DOCOflBBT RESUME 



ED 050 773 



LI 002 835 



AUTHOB 

TITLE 



INSTITUTION 
SPONS AG.ENCy 

BUREAU NO 
PUB DATE 
CONTRACT 
NOTE 



Resnilcoftr H- L,,; Dolby, J- L. 

Access: A Steady o± Information Storage and Retrieval 
with Emphasis on Library Information Systenis- 
Interim Report, 

R and D Consultants Co,r Los Altos, Calif- 

Office of Education (DHEW) , Washington, D. C- Bureau 

of Research- 

BR-8-0548 

21 May 71 

OEC-0- 9-140548-2791 (095) 

225p- 



EDRS PRICE EDRS Price MF-$0-65 HC-$9,87 

DESCRIPTORS ♦Archives, Books, Indexes (Locaters) ^ ♦Information 

Retrieval, ♦Information Storage, information 
Systems, ♦Library Collections, ♦Library Materials 



ABSTRACT 



Chapter: I: "Introduction and Summary ot Results," 
stresses the view that the problem of insufficient access is 
primarily a problem of the great size of the archives to which access 
is desired- Chapter ll: "Levels of Information Storage and Access," 
is directed toward the problems of library archives and in this 
context it is access to the content of books and collections ot books 
that is of immediate concern- Chapter III: "Mathematics of 
Information Distributions," is devoted to the mathematical study of 
some of the distributions that arise naturally in the study of 
information systems- Chapter IV: "The Structure of the Back-of-the 
Book Indexes," is a study of indexes to books in order to determine 
what structure, if any, they possess- Chapter V: "Algorit hmetic Text 
Indexing," is also exclusively concerned with back-ct-the book 
indexes- Chapter VI: "Amalgamati ve Access Mechanisms," looks at the 
problem of discovering possible methods for accessing books- Examples 
of indexes are appended and the tables included are listed- (NH) 



ED050773 



if 



U.S, DEPARTME^JT OF HEALTH, EDUCATION 
& WELFARE 
OFFICE OF EDUCATION 
THIS DOCUMENT HAS BEEN REPRODUCED 
EXACTLY AS RECEIVED FROM THE PERSON OR 
ORGANIZATION ORIGINATING IT POINTS OF 
VIEW OR OPINIONS STATED DO NOT NECES 
SARILY REPRESENT OFFICIAL OFFICE OF EDU 
CATION POSITION OR POLICY 



INTERIM REPORT 
PROJECT NO. 8-0548 



CONTRACT NO. PEC-0-9-140548-279 1 (095) 



' ACCESS 

A STUDY OF INFORMJivTION STORAGE AND 
RETRIEVAL WITH EMI’HASIS ON LIBRARY 
INFORMATION SYSTEMS - 



H. L. RESNIKOFF and J. L. DO! BY 
7 R & D CONSULTANTS COMPANY ' 

Los Altos, California and Houston, Texas 



21 May 1971 



lO 

00 

i QO 

o 

o 




The research reported herein was performed pursuant 
to a contract with the Office of Education, U. S. 
Department of Health, Education, and Welfare. Con- 
tractors undertaking such project under Government 
sponsorship are encouraged to express freely their 
professional judgement in the conduct of the project. 
Points of view or opinions stated do not, therefore, 
necessarily represent official Office of Education 
position or policy. 



U. S. DEPARTMENT OF 

HEALTH, EDUCATION, AND WELFARE 

Office of Education 
Bureau of Research 



1 



ACn»C3Nl£l>G£MENT 



T3ie aulztiors wxsh to ejgpress thexr appreciation "to 
Riciiard O’Keefe and otiier menibers of tfae library 
staff of tile Fondren I>ibraiy, Rice Oniversity for 
their generous help and cocperation in the selection 
of the Fondren Index Sanple which provides the cen- 
tral data base of this study. We are also indebted 
to the late Gerald Mitchell of the Institute for 
Defense Analyses \d»o aided us in the preparation of 
the distribution of digraphs; to ^The Conference 
Board of the Mathematical Sciences, and particularly 
the NISIMS Comaiittee , who supported those aspetrts of 
this work particularly concerned with accessing 
mathematical archives; to John W. Tukey, the Statistical 
Research Techniques Group of Princeton University 
and the National Science Foundation who srpported 
the work on algorithmic indexing and made available 
for this stud^ preliminary output from their permuted 
title listings of the retrospective file of statistical 
papers? and to m. Puri, Department of Mathematics, 
Indiana University, for M.s thoughtful contributions 
to the study of the mathematical models of access 
systems - 

Finally, we should like to acknowledge the contributions 
of the staff of R & D Consultants Company; William E.^ 
Houchin, peirticulaurly for his work on the information 
theoretic aspects of the problem; Vcd Forsyth for her 
Invaluable contributions to the overall, data handling 
problems; and to Joan Resnikoff and Rena Wells for 
tiielr painstaking efforts in analysing in fine det 2 d.l 
the index structure of the Fondren Index Sanple. 



11 



2 



TABLE OF CONTENTS 



CHAPTER. I INTRODUCTION AND SUl'lMARY OF 

RESULTS 1 

CHAPTER II LEVELS OF INFORMATION STORAGE 

AND ACCESS 18 

CHAPTER III MATHEMATICS OF INFORInATION 

DISTRIBUTIONS ..... 54 

CHAPTER IV STRUCTURE OF BACK OF THE BOOK 

INDEXES 85 

CHAPTER V ALGORITHMIC TEXT INDEXING 112 



CHAPTER VI AMALGAMATTVE ACCESS MECHANISMS . . 140 



APPENDICES 

APPENDIX I ABSTRACT INDEX ENTRIES: A UNIFORM 

SAMPLE FROM THE FONDREN INDEX SAMPLE 

APPENDIX II INDEX PAGE REFERENCE DISTRIBUTIONS 
FROM THE FONDREN INDEX SAMPLE 

APPENDIX III AMALGAMATED ALGORITHMIC INDEX TO 
ABSTRACTS IN STATISTICS 

APPENDIX IV AMALGAMATED ALGORITHMIC INDEX TO 
ABSTRACTS IN CANCER RESEARCH 



LIST OF TABLES 

Abstract Index Entries from Computerized 

Library Catalogs 15 

Distribution of Entry Length in Characters 
(excluding page references) 27 

Size in Chara’cters of Various Bibliographic 

Units 38 

Lognormal Standard Deviations . 40 

Grouped Number of Index Entries for 

Monographs 7 5 

Distribution of Lexed Words from the Shorter 
Oxford Dicticnary by number of vowel strings . 79 

Fondren Sample: Fraction of Sample Itepts 

Containing an Index, by LC Letter Class - - • 90 

Index Access by LC Class 91 

Frequency of Index Entries for Items in the 
Fondren Index Sample 92 

Zipf-Mandelbrot Exponent for Index Location 
Distribution, 100 

Comparison of High-Frequency Index Entries 

with LC Subject Headings Sf Titles - - ... 104 

Dis tribution of Index Entries by Word Length 
Subsample of the Fondren .Index Sample ... - 109* 

Short List of Stop Words arranged by Word 

Length 117 

Excluded Index Tex*ms Referring to One Location 125 

Index Entry Length Distribution, Computerized 

Library Catalogs . * - 132 

Index Page Location Distribution, Computerized 
Library Catalogs 



135 



Entry Length Distribution , A Igor i thmi c Index 
to Statistical Abstracts 



150 



Abstract Number Location Distribution, 

Algorithmic Index to Statistical Abstracts . • . 151 

Bibliographic Description of the Statistics 

Index Sample 1 *. .... 154 

Abstract Entries for the Amalgama^ted S tat is ti cs 
Index Sample .................. 156 




V 




LIST OF FIGURES 



Number of Serial Publications with Trend Line 3 

Exponential Growth of Scientific Joiirnals 

and Abstract Journals 5 

Index Ref erence • Distributions 10 

High Frequency Index References 12 

Text and Page Location Distributions 14 

Title Length Distribution in Characters ^ 

Fondren Index Sample 22 



Distribution of Size of Table of Contents 

for Books in the Fondren Index Sample .... 23 



Index Length Distribution, Fondren Index Sample 25 

Distribution of Entry Length in Characters, 
Statistical Index Sample 28 

Distribution of Book Length in Pages for books 
with Indexes from the Fondren Sample 31 

Distribution of Size of University Libraries . 34 

The Level Structure of Access Systems .... 35 

Size of Two-“Yectr College Libraries 37 

Distribution of Size of Bibliographic Units . 42 

S ome Access Distributions in Logarithmic 

Variables . . . . ' 43 

Distribution of Number of Characters per 

LC Subject Heading 46 

Distribution of’ Number of Subject Headings in 
Fondren Sample (excluding Serials) 46 

ALTEXT Macros Ranked by Number of Assembly 
Language Instructions * 49 

Word Length Distributions (in Characters) . • 50 

Standard Curve of English and Latin jAfords . • 



O 

ERIC 



vi 

6 



56 



Distribution of Most Frequent Ordered Pairs 
of English Words 6 6 

Index Entry Distribution from the Fondren 

Index Sample 6 7 

Page Distribution from the Fondren Sample . . 69 

Coefficient of Skewness vs. Coefficient of 
Kurtosity for Lognormal Functions 74 

Shorter Oxford Dictioi^ary 7 8 

Distribution of Income 81 

Distribution of Index Length by Number of 

Index I'lntrios , Fondren Index Sample ..... 94 

Number of Index Entries vs. Number of Page 
References 97 

Distribution of Zipf -Mande Ibrot. Slopes, 



Subsample of the Fondren Index Sample .... 101 

Distribution of Index Exitries by Word Length, 
Subsample of the Fondren Index Sample .... 110 



Word Frequency vs. Rank, Brown University 

Standard Corpus of American English 115 

Algorithmic Index to Computerized Librar y 

Catalogs . . 121 



Index Entry Length Distribution from the 

Algorithmic Index to Computerized Library 

Catalogs 134 



Index Page Location Distribution from the 

Index to Computerized Library Catalogs . . > ^ 136 

Permuted Title Index Page (Left Side) .... 144 

Permuted Title Index Page (Right Side) .... 145 

Abs tract and Abs tract Index 146 

Cumulative Index to 50 Abstracts (one page) . 147 

Entry Length Distribution in Words, Index 
To Statistical /abstracts . 150 



O 

ERIC 



vii 

7 



■1 



Abstract Number Location Distribution , 

Index to Statistical Abstracts 151 

Pank-Frequency of Reference Distribution , 
Statistical Index Sample 155 

Accumulated Index to Books in Statistics 

(one page) 158 




viii 

^8 



CHAPTER I 



INTRODUCTION 

AND 

SUMMARY OF RESULTS 






INTRODUCTION AND 



SUW1ARY OF RESULTS 



This monograph describes work performed by R & D 
Consultants Company during the first twenty-six months 
of contract #OEC~0--9-140548-2791 (095) with the Office 
of Education of the Department of Health, Education, 
and Welfare. The contract is titled "A computer-aided 
study of access management and collection management 
in libraries”; its principal objectives are the 
development of a model for information access and 
storage systems, and the study of the structure of 
existing access systems with the intent of augmenting 
them in significantly useful ways by means of auto- 
mated processing of machinable data bases. 

The concern which underlies this and many other projects 
is that the rapidly growing body of information stored 
in library archives is overwhelming the traditional 
means of obtaining access to it in a reliable, timely, 
and comprehensive manner. 

In fact, general archival collections have been growing 
in an essentially exponential manner for more than three 
hundred years, and perhaps for much longer. Figure 1.1, 
drawn on semilogarithmic graph paper , illustrates this 
phenomenon for serials noted in the Union List through 
1930; this collection has been, doubling in size _very 
thirty years. if the trend line is extended bi;ick in 
time, it suggests the publication of a "f irst” i printed 
serial about 1435, which agrees^ remarkably well with 
the invention of printing circa 1440-1456 . Although the 
weight of this evidence is insufficient to convincingly 
show that serial growth has in fact been exponential 
since that time, it does support the contention that 
the exponential growth of archives is a fundamentally 
long-term property, undoubtedly secondary only to 
economic and population growth and determined by them. 
Consequently, it must be anticipated that archival 
growth will, for the forseeable future, continue to be 
exponential apart perhaps from fluctuations of minor 
duration because there does not yet appear to be a signif 
icant slackening in either population or long term 
economic growth for the world as a whole. 



2 



10 



11 




Some insight into why this "information explosion " has 
received particular emphasis in recent years can be 
gained from a study of Figure 1.2, adapted from De Sella 
Price (1 ) which shows that in addition to the exponential 
growth of the number of scientific journals since the 
second half of the seventeenth century, the number of 
scientific abstract journals has also been growing 
exponentially, and at the same ra te, since their intro- 
duction in about 1825. The abstract journals provide 
access to the larger body of primary scientific journals; 
the figure shows that the need for this secondary form 
of p''‘blication apparently appeared when the number of 
sci ific journals reached 300. The number of abstract 
jour Is reached 300 by 1950, making it as difficult 
to access the abstract journals as it had been to 
access the primary archive in 1825. This suggests that 
one of the reasons for the current serious concern 
about problems of information storage and retrieval 
is that it is once again necessary to invent an appro- 
priate form of (tertiary) publication which will permit 
another period of orderly growth of the archive . 

If the historical and current trend continues unabated 
for another fifteen years, there will be about 500,000 
different scientific journals in existence, publishing 
more than 25 million papers each year; simi3,ar quantities 
of information will be spewed forth by other fields of 
endeavour. * It is clearly not the problem of storing 
this information that makes the prospect of such prolific 
productivity terrifying; current microform techniques 
are already sufficient to reduce the physical storage 
requirements to much less than that presently required 
to store the current production of journals published 
in conventional form. Moreover, standardization of 
microform stores make it possible to implement physical 
retrieval systems that are faster and cheaper than present 
typical library storage techniques. Nor is the pros- 
pect of having to read all of the published material the 
significant problem. No scientist since 1800 has had 
the time to read "all" of the papers published even 
had he the inclination to do so ; the situation is the 
same in most other fields. The inevitable fact that the 
fraction of published papers read by an individual 
is going to drop a few more orders of magnitude is 
hardly consequential . 

The problem posed by the explosion of information is 
only overwhelming when the difficulty of finding 
a particular fact or result in the vast sea of information 
is considered. It is this problem of access to which 
the work reported here is addressed . 



Figure 1.2 



Exponential Growth of Scientific 
Journals and Abstract Journals 



Number of Journalo 




FIRST ORDER 
ACCESS SYSTEfI 



SECOND ORDER 
ACCESS SYSTEM 



Number of Scicnlinc Periodicals (Dala from D, J. 
dc Soil a Pjivf. .SlriVnc e sjiicr liohyton |Nev» 
Haven, 1961 1, p. 97). 




13 



5 



‘s ' 



Chapter II introduces a level structured model for 
access systems , which can be briefly described here . 
Restricting attention to collections of information ex- 
pressed in natural languages, size can be reasonably 
measured by the number of characters, including linguistic- 
ally necessary interword spaces, contained in the collection. 
For naturally occurring informational units such as 
the book title, table of contents, book index, book, 
and, regarding amalgamated information stores, the 
university library card catalog and the university 
library itself, the average size of each informational 
unit is nearly an integral power of a fixed number K 
of characters. The value of K is nearly 30. For 
example, K - 30 is the average length of a book title 
measured in characters (as well as the average length 
of an index entry and of the subject heading information 
on a Library of Congress catalog card) ; = 874 char- 

acters is approximately the average size of a table of 
contents; = 25^822 of the average book index, and = 
763,203 of the average book. In each case, the average 
length is remarkably close in value to the power of K 
in question. 

If the size of an information collection is expressed 
as a power of K, say K^, then it is convenient to define 
the level of the collection as the integer closest to 
X. With this convention, the level of a book title, 
table of contents, book index, and book is, respectively, 
1,2,3, and 4. A university library is of level 8. 

It therefore appears that the traditional means for 
retrieving information stored in a book are structured 
in levels which are equally spaced when measured by 
their level, that is, when measured by the Irsgarithr’ 
of their size. 

If one information base is an access system for another, 
as a book index is for a book, then the order of 
access is defined to be the difference between their 
levels. In general, the larger the order, the less 
expensive is the access system insofar as its construction 
and maintenaiice are concerned, relative of course to 
the cost of obtaining and maintaining the accessed data 
base; but the smaller the order, the more effective the 
access system will be in locating specific information 
and accurately reflecting the content of the accessed 
archive. For instance, a title list is less expensive 
and less informative than a collection of abstracts 
(such as Chemical Abstracts ) in specifying the content 
of journal articles in chemistry; the former is of 
order 2, the latter of order 1. 



O 



6 



The level structure described above will provide a 
valuable management tool for determining , amongst other 
things , the reasonable size and cost for a system de- 
signed to access a given information base only if the 
average size of a class of information bases is typical 
of the distribution of sizes in that class . That this 
is indeed the case is strongly attested by extensive 
data sampling studies presented in Chapter II, including 
the analysis of more than 500,000 index entries occurring 
in a random sample of books drawn from a medium size 
univers ity library - All of the evidence reinforces 
the hypothesis that the distribution of size 
of information collections belonging to a class is 
lognormal ; that is , the distribution of the logarithm 
of the size of the informational units belonging to 
the class is a normal distribution . Each access level 
corresponds to a different lognormal distribution. It 
turns out that the variance of the occurring distri- 
butions are all nearly the same throughout the entire ' 
range from level 1 (titles) to level 8 (university 
libraries); this means that the distributions depend 
essentially only on their mean and are therefore character- 
ized by their level. This justifies the use of the 
notion of level as a measure of an access system. 

The principal objective of Chapter III is to show that 
the lognormal distribution of size of informational 
units belonging to a class (e.g. , titles, books, libraries) 
is a mathematical consequence of certain reasonable 
assumptions concerning the "effort" or "cost" of using 
an item in an access system if the complete system maxi- 
mizes the output of inforraation per unit effort expended. 
Our argument is a minor extension of Mandelbrot's 
derivation of the generalized Zipf-Bradf grd distribution; 
cp. Refs. (2), (3). The remainder of the chapter describes 

general mathematical properties of lognormal distributions 
with emphasis on the most convenient but nevertheless 
laborious and not entirely satisfactory technique for 
fitting lognormal functions to sample data ; a number of 
worked examples which are of independent interest 
are included . 

Unfortunately we do not know a theoretical argument that 
will produce the equispacing of the level structure 
of the means of the lognormal distributions associated 
with an access system; this aspect of the access model 
rests entirely on observational evidence . 

The book is still the most natural informational unit 
for those concerned with library matters. According to 
the access model, 'chere are exactly four orders of access 
associated with collections of information of this size. 



7 





and^ as we have already remarked, there is a tradi- 
tional access system operating at each of these levels: 
the index is of order 1, the table of contents of 
order 2 , the title of order 3, and finally, the 
Library of Congress letter class, which partitions the 
entire span of written human knowledge into 21 grand 
categories , is of order 4 . 

Although these access levels are , in accordance with 
the prescription of the model , the only ones possible , 
there are of course many different types of access 
systems which can function at each of these levels 
in addition to those just named. For example, a 
nine page review of a 277 page book provides typical order 
1 access for the book of average size . Order 1 access 
systems most accurately reflect the content of the 
information collection they access and can moreover 
form the subsidiary information base from which access 
systems of higher order (i.e., lower level) can be 
constructed. Becav.se this procedure obviously cannot 
be reversed- -a low order access system can never be 
constructed from one of higher order — order 1 access 
systems deserve special study . 

Of the traditional order 1 access systems, the book 
index is the most amenable to extensive statistical 
analysis, both because it is found in close proximity 
to the book text to which it refers (which is generally 
not the case for book reviews) and because it is naturally 
composed of a large number of homologous small entities 
which are suitably arranged for analytical study . 

We have investigated three major collections of book 
indexes. The first contains more than 100 books drawn 
from the present authors* libraries; although this sample 
exhibits some variation in subject matter, science and 
more particularly history, mathematics, and physics are 
heavily weighted. The second sample consists of 80 
current books in statistics and probability theory 
and comprise what can be thought of as a special istfs 
hand library; it undoubtedly accurately reflects the 
nature of indexes to books in these fields . All index 
entries in this sample were committed to machine readable 
form to permit ready reorganization and analysis of the 
collection of index entries, of which there were 31,232. 

Study of these collections was instrumental in guiding 
us to the formulation of the access model presented in 
Chapter II , but their Iimitations“-principally their 
restriction to few subject areas and the undoubtedly 
biassed method of their selection, but also their rela- 
tively small size — clearly indicated the desirability 



of carefully selecting a random sample of books and 
their indexes from a broadly representative archive. 

The third index sample consists of such a random selection 
of 706 indexes from the Fondren Library at Rice University. 
For each book in this sample, copies of the shelf 
list catalog card, title page, table of contents, and 
index were made. From this information it is possible 
to determine the size of the book (in pages, which can 
then be approximately adjusted to equivalent ni.imber 
of characters) and the precise number of characters in 
the title, table of contents, and index, which are the 
three significant traditional book access systems that 
are normally packaged with the book itself. 

Chapter IV describes the structure and properties of 
traditional back of the book indexes based on a study 
of these three samples. There are three main conclu- 
sions: first, the average number of index entries per 

index is determined; the result is 836, with relatively 
little variation throughout the different Library of 
Congress letter classes. Second, it is shown that 
the distribution of the number of books as a function 
of the number of entries in their index is lognormal, 
providing further support for the access model derived in 
Chapter II. The remainder of Chapter IV is devoted to 
a study of the distribution of the number of text 
references per index term in a given book. The under- 
lying idea is an outgrowth of the simple observation 
that those index entries that refer to only one text 
page cannot typify the general content of the book, 
whereas an entry that refers the reader to 40 or 50 
text pages is truly of little specific utility to the 
reader except insofar as it points out one general topic 
of the book. It is therefore conceivable that some 
subset of the index functions as a collection of "key 
words", specifying the semantic content of the work and 
serving little further purpose . Were it possible to 
separate this subset from the other more numerous index 
entries, the way would be cleeir for automatic descriptor 
determination based on a machine readable index; moreover, 
if the process of constructing the index itself could 
be automated, iteration of these processes would lead 
to the descriptors as well and quite possibly to a 
successful method for man-machine interactive content 
classification . 

Figure 1.3 exhibits the page reference distribution 
for two books; LB875.C7 was published in* 1922, is titled 
Two Views of Education , and contains 775 index entries, 
v/hereas DS423.C85 v4 is the fourth volume of The Cultural 
Heritage of India , published in the i.nterval 1953-58, and 



LOGARITHMIC ^f960 46 7602 

5X3 CYCLES -Ot IN U. l.A, 

XEUFFEL & CSSER CO. 



m 

a 



nj 



a 



o 






C3 




ERIC 






la 



10 



NUnSfR OF PArt^ PTPrPr 



containing 4906 index entries. It is clear from the figure 
which is drawn on full logarithmic graph paper, that 
except for the quite small numbers of entries referring 
to very many pages both distributions are linear on 
the graph paper and hence the number of index entries 
is a power function of the number of page references. 

From the theoretical considerations in Chapter III 
one is led to suspect that these graphs ought perhaps to 
represent lognormal functions, which appear on logarithmic 
graph paper as parabolas. As we show in the third 
chapter, the power function, represented by a straight 
line, is a degenerate form of the lognormal representing 
parabola. Other book indexes, as for instance that of 
The 1969 World Almanac , illustrated at the left half of 
Figure 1.4, do indeed flaunt a tell-tale curvature and 
can be accurately fitted by a parabola. This is the 
third major result of Chapter IV: the page reference 

distribution of index terms is, generally, a lognormal 
function which may degenerate into a power function . 

There are about 6600 index entries in The 1969 World 
Almanac . This book and others like it are more thoroughly 
and densely indexed than most, but let us for the moment 
treat this index as an order 1 access system for its 
text , as usual , but simultaneously consider it as an 
information base requiring access systems. Then selection 
of that 1/30 of the index which refers to the largest 
number of page locations will produce an order 2 access 
system for the original text, which will be of the size 
of a table of contents. Repeating this operation leads 
to the 2 selection of the subset of the index which is about 
1/(30) = 1/900 the size of the index and approximately 

the size of a title. This process produces about 8 
index entries, substantially larger than a title because 
of the peculiarities of almanacs. Figure 1.4 shows the 
four most "popular” index entries; in approximately 
the space of the title they provide an order 3 precis 
of the book's content which is a not unuseful alternate 
to the title itself. 

The distribution for Nader's Unsafe at any Speed is shown 
in the same figure; the three, most popular index entries 
again provide a cogently descriptive view of the book's 
content which is in fact not provided at all by the title. 

Chapter IV pursues the study of the effectiveness of the 
popular index entries as content descriptors through • 
the analysis of a uniform subsample of the Index Sample.; 
for this subsample those index entries which refer to 
large numbers of text locations have been explicitly 
listed. The subsample is included as Appendix I, 



O . 11 



High Frequency Index References 



m ru 

CZJ CD □ 




laa 1 ID IDO 

NUnBER OF PAGE REFERENCES 



The problem of automatically indexing documents and 
books has intrigued computer buffs for years. Numerous 
programs are now available and many learned research 
papers have been written describing them, and how modest 
are their demands on the machines that implement them, 
and how effective they are in satisfying rather impre- 
cisely stated hypothetical requirements of potential 
users. But as far as we have been able to learn, no 
commercial oi; professional publishing house uses machines 
to index books or papers. The reason that this is so 
consists of a complex of subreasons not all of which 
have to do with the adequacy of machine methods, but 
it is certainly true that the general complexity , 
inflexibility, and simple inadequacy of these programs 
have acted, as strong deterents to their use. Although 
the problem is har*dly a trivial one, we think that one of 
the most significant factors hindering the development 
of indexing algorithms that will rival and surpass 
human performance is that no one has ever attempted 
to assess precisely what properties human produced 
indexes . actually have as opposed to what indexers and 
students of indexing believe ought to be the properties 
of indexes. The availability of the Fondren Index 
Sample has made it possible to assess human performance 
in this area, and to set standards for the performance 
^ of machine methods of indexing which are objective, and, 

insofar 3ls> they refer to the structural statistics 
of indexes rather than their semantic content, also 
measurable. Elucidation of these common structural 
characteristics of human indexes, have in turn suggested 
some new approaches to the problem of machine indexing. 
Chapter V is devoted to one such new method. Figure 1.5 
illustrates the text location and page location reference 
distributions for the algorithmically produced index 
to Computerized Library Catalogs ; Their Growth , Cost 
and Utility . Based on our study of the Fondren Index 
Sample we can assert that this algorithmic index is 
the "right size"? moreover, it is evident from the 
figure that the reference distributions agree well 
with typical distributions associated with human indexes. 
The details of the algorithm as well as of the index 
referred to by the figure are the subject of the fifth 
chapter. 

Combixiing these results with those of the previous 
chapter leads to a new method for obtaining keyword 
descriptors, which is discussed in the context of the 
particular algorithmic index exhibited. Indeed, this 
index consists of 340 entries?, an order 1 access system 
acting on this index should select about 12 entries 
which would provide an "abstract" of the content of the 
index. There are 9 entries referring to at least ten 
page locations, but 14 referring to at least nine. Table 
1.1 lists these ‘14 abstract entries together with the 
number of pages to which each refers. 

O 13 

ERIC 




Text and Page Location Distributions 



a 







a 



ru 



a 




o 

>• 

[— 



2: 

< 



M 

O 

u 



□c 

X o 
Ld OC 
VI 

:z 

M CC 
h- 
U 

M X 
C H 
X 

H* *• 

M K 
QI VI 

o o 

-J < 
I- 
-4 

> 

cc 

Ck 

a 

j- 



u 

N 

h- 

Ct 

u 

a 

r 

c 

v^ 

0 



Table 1.1 



ABSTRACT INDEX ENTRIES 
FROM 

'COMPUTERIZED LIBRARY CATALOGS: . . 



No. of Page 
References 



Index Entry 



16 

16 

14 

14 

14 

13 

12 

11 

10 

9 

9 

9 

9 

9 



LC 

GNP 

growth rate 

library catalog 

machine-readable form 

Library of Congress 

gross national product 

university library 

exponential growth 

bibliographic record 

Fondren, see Rice University 

Sample , see Fondren Sample , 
Rice University 

shelf list < 

Stanford 




15 



i;s23 



In this case the abstract entries provide an accurate 
capsule view of the problems studied in that book as 
well as a list of the principal sources c f information 
upon which it bases its arguments. 

Although there are only our levels available for 
accessing the text of books and each of these is already 
served by a traditional access mechanism, there remain 
many possibilities for repackaging access information 
in order to serve needr^r^-th^^^sannot be met by traditional 
means. Many of these are a malgamative in the sense that 
they combine access information associated with numerous 
comparable unit information stores in a reorganized 
manner that permits ready selection of the units that 
are likely to contain specific matter desired by the 
user. All document information retriev^^l systems 
operate in an amalgamative manner, as does the library 
catalog. Most low order (high level) amalgamative. 
access systems currently in use organize the acces s 
information in a sequential fashion based first on 
date of publication and secondarily according to some 
scheme of content classification. This is the procedure 
used to organize professional society abstracting journals 
(which provide order 1 access); its success depends 
entirely on the accuracy and excellence of the content 
classification system and the classifiers who implement 
it. Such systems, which represent professional consensus 
concerning significant categorical classifications, 
are in general partly obsolete, especially in rapidly 
growing fields suck as chemistry where the quantity 
of published material may d^'^uble in as few as eleven 
years. Moreover, although the bulk of the classified 
material remains stable as the classification system 
expands and is refined, some fraction of the archival 
materials, which is likely to include the most innovative 
work, should be reclassified to account for changes in 
classification categories and procedures, but due» to 
economic constraints , it never is . This difficulty 
suggests that it may be desirable to investigate amal- 
gamative access mechanisms which do not depend on external 
classification structures which are inherently slow to 
accommodate themselves to change but rather rely on the 
text terms and systems based on the processing of numerous 
homologous small items such as index entries, as opposed 
to text abstracts , which are of special interest because 
they are less subject to global grammatical constraints, 
and therefore admit a greater variety of potentially 
useful order ings . 

Chapter VI studies two new types of amalgamative access 
systems. The first consists of the com±>ined indexes to a 



ERIC 



16 



collection of books, here illustrated by the combined 
indexes to 80 books in statist‘s. cs and probability theory, 
already mentioned above in another context. The second 
is more unusual. We have applied a version of the 
algorithmic indexing procedure described in Chapter V 
to two samples of 50 abstracts drawn , respectively, 
from the Annals of Mathematical Statistics and the 
Journal of Cancer Research ; the results are exhibited 
and analyzed. 

Appendix I displays the order 1 abstract entries (but 
in some cases involving exceptionally large indexes , 
only the order 2 abstract entries) from a uniform sub*- 
sample of the Fondren Index Sample. 

Appendix II displays the distribution of the number of 
index entries as a function of the number of distinct text 
pages to which each entry refers for the same subsample 
of the Fondren Index Sample used for Appendix I. 

These distributions confirm the assertion that the dis- 
tribution is essentially a power function. 

Appendices III and IV are automatically constructed 
amalgamative indexes to 50 abstracts of papers in 
statistics and in cancer research respectively. The 
algorithm and all internal dictionary-like stores 
used by it is the same for both data sources. 



REFERENCES 

1. De Solla Price, D. J. , Science Since Babylon.^ Yale 

University Press, New Haven, 1961. 

2. Mandelbrot, B. , "An Informational Theory of the 

Statistical Structure of Language ” , Communication 
Theory , Butterworths , London , 1953 . 

3. Mandelbrot, B., "On the Theory or Word Frequencies 

and on Related Markovian Models of Discourse , " in 
"Structure of Language and Its Mathematical 
Aspects", Proc . Symp > Appl . Math . 12 (1961) . 



17 

■ ' 




CHAPTER II 



LEVELS OF 

INFORMATION STORAGE AND ACCESS 







LEVELS OF 






INFORMA.TION STORAGE AND ACCESS 



In the previous chapter we have stressed the view that 
the problem of insufficient access is primarily a 
problem of the great size of the archive to which access 
is desired. This study is directed toward problems of 
library archives and in this context it is access to 
the content of books and collections of books that is of 
immediate concern although libraries are increasingly 
becoming archival depositories of other types of 
information bearing records. 

There are technical reasons that make it desirable to 
restrict attention — at least in a preliminary study 
such as the present one — to the monograph collection; 
we will have some useful remarks to make about serials 
and can also exhibit data supporting the extension of 
the model that will be proposed to describe the serial 
collection . 

The book is a natural halfway house in the hierarchy 
of means for storing written information in libraries . 
Within the book are usually to be found certain standard 
apparatus which aid in directing the user to the internal 
location of information with which the book is concerned; 
these include, in descending order of size, the index , 
the table of contents , and the title . The library 
itself is of course a collection of books but it too 
contains certain apparatus for directing the user to 
those amongst the many books held that contain information 
concerning some particular matter ; these include , in 
increasing order of size , the classification syste m^ 
the reference section , and the c ard catalog . There 
are also other types of traditional access means that 
aid in locating books which contain certain information, 
including special bibliographies and , too often overlooked , 
the reference librarians . If indeed size is the pre- 
dominant factor determining the need for access, then 
a study of the size of the various natural bibliographic 
units named 2bove may shed light on the structure, if 
any, of the traditional access . systems and thereby 
also provide guidelines for those who study the possible 
ways for increasing and automating the means of access. 



O 

ERIC 



19 

2f 



We will proceed up the scale of size of the naturally 
occurring access means associated with books and collec 
tions of books / with the intent of dctermj.ning the 
statistical distribution of size of each such 
system; this information will lead in a natural way to 



the level 
described 


structured model of access systems briefly 
in Chapter I . 


Initially 
there are 


limiting our attention to the book itself , 
four systems of interest : 


1. 


Title 


2. 


Table of Contents 


3. 


Index 


4. 


Book Text 



In each case we wish to know the mean (average) size 
of the item in question, measured, let us say, by^ the 
number of characters (including the interword space) 
contained in the item. Moreover, it will turn out to 
be important to know the distribution of size for each 
case so that it will be possible to say to what the 
extent the mean is characteristic of the distribution 
and also because the distributions will turn out to 
have an intrinsic connection with the access problem 
via the intervention of the mathematical discipline 
known as information theory ; this latter aspect of our 
study will be Se^rxbed in Chapter III. 

It is not easy to obtain reliable statistics about the 
size of bibliographic units; it is especially difficult 
if general samples that are not restricted to one or 
a few fields of interest are desired. We have ba^sed 
our book studies on the Fondren Saraple , a random sample 
of 1926 cards drawn from the shelf list catalog of the 
Fondren Library at Rice Univei“sity in 1968; it has beG»n 
described in some detail in Ref. (1). Associated vcith 
each shelf list card is one or more monographs; these 
monographs constitute the sample on which our study is 
based. Tt is appropriate to refer to it as a ra ndom 
sample of books from a medium sized universitynTibrary . 

Because we are interested in studying the interaction 
of the various traditional access systems used in books 
we have extracted from the Fondren Sample all those 
books that contain an index (here and throughout all 
that follows, index , will of course mejan back of the, 
book index ) , thus yielding what we have called ^he 
Fondren Index Sample , which may reasonably be called a 
random sample of indexes. There are of course certain 



O 

ERIC 



20 

28 



unavoidable biasses present in this index sample: the 

Fondren Library does not have an adequate collection 
in medicine or law, for instance; it has an exceptionally 
fine collection in other areas. But, to the best of 
our knowledge, these samples are the closest in existence 
to truly random samples of books and of books with indexes 
belonging to the complete population of all books ever 
published . 

these preliminaries in mind we can now turn to study 
the structure of book titles. Figure 2.1 displays the 
distribution of the number of characters per book title 
for books from the Fondren Index Sample drawn on lognormal 
probability graph paper . The mean number of characters 
per title is 28.15. 

Next consider the size of a table of contents measured 
by the number of characters it contains. 

Although the "structure" of a book title is relatively 
standardized , the same cannot be said of the table of 
contents. Some books include phrases such as "Chapter 1", 
others simply record "1" to designate the first chapter, 
and others do not bother to indicate the chapter ordinal 
at all. There are tables of contents which include, 
in addition to a chapter title, relatively extensive 
descriptions of the text content of a narrative nature; 
others include section titles. Despite the rather exces- 
sive degree of variation that does occur, there are 
certain components of a table of contents which appear 
to be nearly invariable in their presence, including 
the chapter titles and page number designating the 
beginning of each chapter. We have chosen to define 
the table of contents as that portion of the material 
contained in what is normally termed the table of con- 
tents that corresponds to the chapter title, excluding 
from consideration all headings > chapter ordinals, 
B-PP^ndlces f tables of figures, etc., and page number 
referents to the location of chapter initial pages. With 
this convention, a random subsample of 161 tables of 
contents was selected from the Fondren Index Sample and 
the number of characters (including interword space 
characters) was counted for each selected table of 
contents. It turns out that the mean size of a table 
of contents defined in this way is 505 characters. 

Figure 2.2 displays the distribution of table of contents 
size for this subsample. 



21 



O 



29 



pr.i^ciNiAc.ri 

?'V. 5 !0 15 20 30 40 50 GO 70 80 85 90 95 987^ 




The reader can hardly help but notice that the data 
exhibited in each of Figures 2.1 and 2.2 fall nearly 
along a line, and moreover that the two lines have 
similar slope. The graph paper is so designed that 
straight lines indicate that the data are drawn from a 
lognormal distribution , whose properties will be dis- 
cussed later on in this chapter and extensively in 
Chapter III; it suffices here to stress that thus far 
the data indicates that the two lowest levels of distri- 
bution of size of book access systems belong to some 
well known family of statistical distributions and 
indeed to the same family. We will want to look for 
this possibility when examining data referring to other 
access systems . 

The index is the next largest access tool traditionally 
found in books , and from many points of view it is the 
most important and responsive to the detailed demands 
of the user. It therefore deserves extensive examination. 

The Fondren Index Sample consists of 706 indexes. 

Chapter IV investigates the relati nship of indexed 
books to the unindexed books in the Fondren Sample and 
studies such properties of the indexed books as their 
distribution among the Library of Congress classifica- 
tion categories. Here we are only interested in 
considerations of size. The mean number of index 
entries per index is 836. 

Figure 2.3 contains the distribution of the number of 
index entries per book, again on lognormal probability 
graph paper. It is evident that the data can be accur- 
ately approximated by a line and furthermore that the 
line has a slope which once again is similar to the slope 
of the lines occurring in the previous two figures. 

One word of caution:; here only the number of index 
entries is exhibited. Ideally one would wish to 
measure the size of an index by the number of characters 
it contains, but it would not be feasible to count the 
characters in more than half a million index entries. 
Furthermore, once again the question of which characters 
to count can not be resolved in a completely unambiguous 
way. For instance, it is easy to agree whether page 
reference numbers should be counted, and what to do about 
consecutive spaces used as separators, but format problems 
related to multiple entries grouped under a common 
initial phrase, and inverted order entries demand opera- 
tional decisions that are not often guided by a clear 
cut purpose. These problems exist when entries alone 
are Counted, but they are magnified when characters are 
counted. We have agreed, when counting entries, to count 




24 




32 



PERCrNTAGC 




ERIC 



25 






■'--t 33 



each group of page reference numbers : this define s 

the index entries , at least as far as their cardinal 
number is concerned, and provides a relatively clear 
cut procedure requiring a minimum amount of sub j ective 
decision by the persons performing the counting. In 
order to obtain an approximation to the number of char- 
acters contained in an index, a rather indirect procedure 
was used. We have in a convenient form all of the index 
entries contained in 80 books in the field of statistics, 
all printed in a fixed typefont whose characters are of 
constant width, and printed a fixed number of lines to 
the page. These characteristics make it possible to 
count the number of characters in an entry by measuring 
the length of the entry. This was done for a uniform 
subsample (comprising about 1.75% of the total Statis- 
tical Index Sample of 31,232 index entries). Table 2.1 
lists the number of entries consisting of from 1 to 
76 characters , and , opposite 77 characters , the number 
of entries that had at least 77 characters. The mean 
number of characters per entry, exclusive of page 
reference numbers but inclusive of interword spaces, is 
25.47. Figure 2.4 displays the distribution of 
size of the entries in the Statistical Index Sample. 

If we assume that the distribution of size of index 
entries is independent of the distribution of the number 
of entries per index, then the average number of charac- 
ters per index will be the product of the average number 
of entries by the average number of characters per 
entry. Using the number for the Statistical Index 
Sample for the latter, we find that the average number 
of characters per index (exclusive of page references ) 
is 836 X 25.47 = 21,293. If it be assumed that there 
are typical ly three digits and an intex'word space 
required to provide the page reference location informa- 
tion, then augmenting the average number of characters 
per entry by 4 leads to 24,560 characters per index 
(inclusive of page reference appiqximation) . 

The distribution of index entry length for the Statistical 
Index Sample is, again, lognormal to a high degree of 
approximation . 



Table 2.1 



STATISTICAL INDEX SAMPLE 

Distribution of Entry Length in Characters 
(excluding page references) 



No. of ■ 


No . c ? 


No. of 


No. of 


Char . 


Entries 


Char . 


Entriei 


1 


0 


41 


3 


2 


0 


42 


3 


3 


4 


43 


6 


4 , 


2 


44 


3 


5 


4 


45 


5 


6 


7 


46 


5 


7 


8 


47 


3 


8 


9 


48 


4 


9 


20 


49 


4 


10 


22 


50 


2 


11 


19 


51 


4 


12 


18 


52 


4 


13 


14 


53 


3. 


14 


23 


54 


6 


15 


11 


• 55 


4 


16 


28 


56 


3 


17 


16 


57 


2 


18 


. 8 . 


— 




19 


14 


62 


1 


20 


12 


63 


1 


21 


17 


64 


1 


22 


17 


65 


1 


23 


19 


— 




24 


19 


67 


1 


25 


14 


— 




26 


8 


69 


1 


27 


11 


— 




28 


9 


72 


1 


29 


11 


73 


1 


30 


7 






31 


10 


75* 


1 


32 


9 


75 


2 


33 


10 


^77 


9 


34 


6 






35 


5 






36 


4 






37 


5 






38 


10 






39 


10 






40 


9 ^ 








27 




PEV>r.ENTAGE 

2r^ 5 10 15 20 30 40 50 00 70 80 05 90' 95 987i 





28 

36 



The last of the four natural access tools for monographs 
is the monograph text itself. It will be even more 
difficult to estimate the size of a book measured by 
the number of character s it contains because of the 
variability of type f and page layout supplemented 
by the presence of taoular and figured material. Al- 
though numerous different and justifiable procedures 
of making such a size estimate are conceivable r we have 
once again attempted to choose a method that would be 
simple and insensitive to sub j ective j udgements of the 
personnel performing the task in order to improve 
accuracy but more importantly to make it possible for 
other workers to reproduce (at least nearly) our 
results- Regarding book text^ there are several levels 
of analysis that require an increasing amount of 
extraneous and unstandardized information. The simplest 
measure r and one that it easily reproduced ^ is simply 
to transcribe the arabic number shown on the catalog 
card designating the number of non-front matter 
pages . It is difficult to say precisely which pages 
are represented by that number in each case^ but it is 
unnecessary to do so; we simple agree that this number 
defines the length of the book in pages- The distri- 
bution of book length measured in pages was determined 
in Ref- (1) for the complete Fondren Sample- The mean 
number of pages per book is 276-6; the distribution 
of pages is however not lognormal as is readily seen 
in Figure 2-5- If the corresponding distribution is 
plotted just for those books that do have indexes ( i . e . , 
for the Fondren Index Sample) ^ the graph in Figure 2.6 
results^ which shows that the distribution of size of 
these books i^ lognormal. This suggests that there 
may be some intrinsic structural difference between 
books which contain an index and those that do not - 
If attention is restricted to the Fondren Index Sample^ 
it turns out that the mean number of pages per book 
is significantly greater^ namely 341-5- The next step 
in determining the number of characters per book is to 
find the number of lines per page and their length; 
this has been studied by Dolby and Jones (Ref- (2))^ 
who found 38 lines of 24 picas as the mean. The final 
step in obtaining an estimate of book size in characters 
is to approximate the number of characters per 24 
pica line of print? we have analyzed a sample of printed 
matter and find 63 characters per 24 pica line as the 
mean- . These estimates together imply that an average page 
cf printed text contains 2394 . characters ^ including 
interword and end of line spaces. Hereafter it will 
be assumed that there are 2400 characters per page. We 
have no idea what the effect of tabular and figured 
material as well as other formatting conventions is 




29 

37 



on 




i 



PROESABILITY SCALE 358-22 



PERCENTAGE 




o 31. 



10 

10 

10 

10 



the estimate of book length in characters ; nevertheless , 
evcluding these matters from consideration, we find 
that the average book in the Fondren Index Sample is 
341.5 X 2400 = 819,600 characters in size. 

Turning now to collections of books, let us first consider 
the university library. Here it is essential that the 
notion "university" be specified in some way so as to 
enable one to distinguish university libraries from 
libraries of colleges in a manner consist,ent with that 
used for other purposes by governmental agencies and 
the educational institutions themselves . We implicitly 
use the definition used by the Office of Education of the 
Department of Health, Education and Welfare because we 
use their statistical data book Reference (3) as our 
source of information about the holdings of college and 
university libraries. 

Unfortunately the data presented in Reference (3) 
is incomplete; notable ommissions are the University of 
Chicago and Yale University. Although these ommissions 
undoubtedly will have some influence on the statistical 
parameters of the distributions of interest to us, these 
will most likely be quite minor and in no event can 
they be expected to change the form of the distribution 
nor substantially affect its mean or variance. 

There is one other defect of the data presented in 
Reference (3) which is more critical for our concerns. 

Most state university systems have had their statistics 
amalgamated; thus it is impossible to determine (from 
this source) the size of the library of the University 
of California at Berkeley — only the total number of 
volumes held in the entire California university system 
is presented. This unfortunate state of affairs holds 
for most of the other state systems also and tends 
both to depress the number* of distinct university libraries 
and inflate the size of those that remain. Two factors 
permit us to extract useful information from this 
tabulation despite its amalgamated nature : first, it is 

easy to obtain lists of all units belonging to a state 
system (and also for the few private systems that operate 
more than one campus ) and thereby estimate the total 
number of libraries whose structure must be studied. 

Second, within state systems there is usually one ’giant’ 
library and a number of much smaller ones; this has 
the consequence that the departure of the distribution 
from lognormality , as is shown in Figure 2 . 7 which we 
will shortly consider, is diminished when the separate 
system units are accounted for , and , in view of the 
smallness of the possible effect, it is not necessary 
for us to study this difference in detail. Furthermore, 



32 



we can easily obtain the mean size from the revised 
estimate of the number of libraries. By adjusting the 
number of libraries represented in Reference (3) 
through deletion of the special dental and medical^ 
school branches and addition of all general campuses, 
a total of 201 university libraries is attained . 

The total number of volumes held in these institutions 
is 152,230,163 (nearly one for every inhabitant of the 
United States, and nearly as many as are held by all 
public libraries) , so the mean number of volumes per 
university library is 757,364. The range in size may 
appear remarkable to the reader, ranging as it does from 
some 100,000 volumes to more than 8 million. Figure 2.7 
exhibits the size distribution , wl\ich , as we have by 
now come to expect, is lognormal. 

Knowing that the average book contains 819,600 characters 
and assuming that the distribution of book, size is 
independent of the distribution of university library 
size, we readily find that there are some 620,735,534,400 
= 6.2 X lO^'^, or approximately 620 billion characters 
stored in the average university library. 

At this point we have established the mean size and 
distribution of size for book based bibliographic enti- 
ties ranging in average size from about 30 characters 
up to 620 billion characters, entities which differ 
in size by a factor of 20 bi].lion. Our immediate task 
is to demonstrate that there is a simple and reasonable 
model which encompasses the entire range of biblio- 
graphic entities in a systematic way, relating those of 
one size to those of another in a^ uniform and un- 
varying manner. 

In order to proceed , recall that the book title , table 
of contents , index , and text are four bibliographic 
units of increasing average size; let us say that 
they belong to levels 1 , 2 , 3 , 4 respectively . Let 
stand for the base 10 logarithm of the average 
size of the units belonging to level n; Figure 2.8 
displays the points whose coordinates are (n,Y^) for 
n = 1,2, 3, 4, and also the point (8 ,Yq) where Yg is the 
base 10 logarithm of the mean size of a university library, 
and the point (7,Y-) where Y is the base 10 logarithm 
of the mean size of a two-year college library, obtained 
by analyzing the first 206 two-year college libraries 
listed in Reference (3)*; this procedure is^ biassed, 
leading to a slightly high estimate of the mean size of 
two-year college libraries because the State of California 
dominates the initial part of the list both in number 
of two year colleges and in the size, of their libraries. 




33 




but analysis of the complete list in Reference (3 ) , 
which is presently underway, will undoubtedly lower the 
mean size insignif icantly from, the value 29,912 
volumes used to determine the corresponding point in 
Figure 2.8. 

Figure 2.9 confirms that the size distribution of two 
year college libraries is lognormal and that the slope 
of the line representing the data on that graph is once 
again comparable with the slope of lognormal distri- 
butions presented in previous figures in this Chapter. 

Inspection of Figure 2.8 may lead the reader to wonder 
whether levels 5 and 6 correspond to naturally o::curring 
collections of books; we think that level 5 corresponds 
to general encyclopedias and. level 6 to personal libraries , 
but we have not ventured to include calculations based 
on these hypotheses because of the difficulty of 
amassing . reliable and comprehensive statistical 
information in their support. 

The points in Figure 2.8 evidently lie very nearly on 
a straight line. This means that the mean size, s(n), 
of the bibliographic units comprising the n-th level 
is relat€id to n by an equation of the form 

s (n) = aio^” (2.1) 



where a and b are constants . It is natural to suppose 
that a = 1 so that level 0 corresponds to the single 
character; we will examine the data given in Figure 2.8 
and Table 2.2 which corresponds to it to see if it is 
consistent with this desirable and simplifying 
hypothesis . By a standard application of the 



2 ' 

d 

8 

7 

6 

5 

4 

3 

2 

5 

9 

8 

7 

6 

6 

4 

3 

2 

9 

8 

7 

6 

5 

4 

3 

2 



PERCENTAGE 

5 10 15 20 30 40 50 60 70 80 85 90 95 




lio 8,8 ’ ’ 4,0 *4.8 8.0 Bis oio ' B.b ’ 7,0 

PROBITS 



i 37 

45 



Table 2 . 2 



SIZE IN CHARACTERS 
OF VARIOUS BIBLIOGRAPHIC UNITS 

Unit Level •• , Size " Log^^ of Size 



Title 


1 


28.15 


1.44948 


Table of 
Contents 


2 


505. 


2.70329 


Index 

i 


3 


21293- 


4.32710 


Text of Book 


4 


819600. 


5 .91360 


Two Year 

College Library 


7 


24528169200 . 


10.33966 


University 

Library 


8 


620^735534400. ' 


11.79291 



statistical F-test, as described for instance in 
Ref. (4), it is easily shown that the data does not 
contradict the hypothesis that a = 1 in eq . (2.1) 

at the 5% confidence level; this means that the least 
squares best fitting line for the points in Figure 2.8 
does . not differ significantly from that line which is 
constrained to pass through the origin of the coordinate 
system and also minimizes the sum of the squares of 
the deviations from the data points. This latter line 
corresponds to a relation of the form 



/ \ n r^bn 

s (n) = 10 



( 2 . 2 ) 



relating the mean size of bibliographic units to their 
level. Carrying out the least squares minimization for 
a function of this form on the logarithms of the data 
leads to the line drawn in Figure 2.8 which corresponds 
to the equation 



s (n) 



^Ql.47247n 



(29.68)*^. 



O 38 

ERIC 



ii.3) 



The coristant 29-68 is an estimate of the fundamental 
constant determining the level structure of the biblio- 
graphic units considered above . More extensive data 
will no doubt result in the modification of this value^ 
but it can be said with certainty that the fundamental 
constant is approximately 30 , and perhaps may be 
identifiable with (2e)^ = 29-54-.., where e = 2-718--. 
is the mathematical constant denoting the base of the 
natural logarithm sys tem. 

This is our first main result: 

The average size of the bibliographic units 
title , cable of contents , index , monograph ^ two 
year college library , and university library are 
powers of a fixed constant K whose value is 
nearly ( 2 e) 2 . 

If it could be shown that the mean size of an encyclo- 
pedia is approximately and that of a personal (or 

perhaps a library reference sublibrary) is about , 
then it could be asserted that the natural bibliographic 
units are equ is paced when measured by the logarithm 
of their size; the current state of knowledge only permits 
us to assert that this is so for levels 1 through 4 
and also for the separation of levels 7 and 8. 

The previous argument suggests that the notion of level 
be introduced more generally- Therefore define the 
level of a given information base to be the integer 
closest to the logarithm of its size (the latter measured 
as usual in characters ) to the base K; moreover , if 
a system of level K provides access to an information 
store of level n , then define the order of access 
provided by the access system as (n-k) . Thus an index 
provides access of order 1 (=4-3) to the monqgraph 

it accompanies, and similarly the table of contents 
and title provide access of order 2 and 3 respectively 
to the book with which they are associated - We will 
later find that a library card catalog provides access 
of order 2 to the library archive but unfortunately it 
occupies a physical volume which could pjcrovide order 1 
access to the collection . 

Thus far we have principally concerned ourselves with 
the mean value of the various size distributions that 
have been examined , and have thereby shown that there 
is a simple and uniform relationship which connects 
the smallest of the natural units to the largest - We 
must now take up the question of the extent to which the 
mean characterizes the distributions that occur - 
The figures displaying the various dir»tributions at 
the same time' provide powerful evidence ’ that all of the 



distributions are lognormal . The elementary form of 
the lognormal function , which is what occurs here , 
depends on two parameters — the lognormal mean and the 
lognormal standard deviation ; if these parameters are 
known, then the usual mean, value of the distribution 
can be determined and conversely , if the lognormal 
standard deviation and the usual mean are known, the 
lognormal mean and hence the lognormal function itself 
are completely determined ( cp. C hapter III). . From this 
it follows that if the lognormal standard deviation of 
the various distributions of interest are all essentially^ 
equal , then the associated lognormal functions are 
in reality determined’ by the mean value , that is , by ' 
the level , of the distribution. We shall show that this 
is indeed the case. Table 2.3 lists the lognormal 
standard deviation of the six .distributions that have 
been described thus far. 



Table 2.3 



LOGNORMAL STANDARD DEVIATION 



Unit 


Level 


Lognormal S.D 


Title 


1 


0 , 19 


Table of Contents 


2 


0.30 


Index 


3 


0.44 


Monograph 


4 


0.23 


Two Year College Library 


7 


0.29 


University Library 


8 


0.36 



There is evidently not much, variation of the lognormal 
standard deviation as the level changes from a distri- 
bution whose typical size is about 30 characters to one 
whoso typical size is about 600 billion characters and 
in particular v/hat variation there is does not seem to 
have a trend. biased on the data contained in Table 2.3 
we assert that the lognormal standard deviation is 
essentially constant throughout the entire range of 
bibliographic interest , and consequently the distributions 
r^^s ize of the vaj^ious bibliographic units are determined 
by the level of the unit . 




40 

•:48 



The lognormal standard deviation corresponds to the 
slope of the line defining the lognormal function for 
figures drawn on lognormal probability graph paper such 
as Figures 2. 1-2. 7 and 2.9 are. The underlined state- 
ment in the previous paragraph is the analytical version 
of the geometrical assertion that the lines representing 
all of the distributions are nearly parallel. We show 
to what extent this is so in Figure 2.10 which displays 
the distributions for all six levels; the variation of 
slope is indeed not great . The mean value of* the stan- 
dard deviations listed in Table 2.3 is 0.30, which may 
be conveniently adopted as an estimate of the level- # 

independent lognormal standard deviation . 

The assertion that the distribution of a variable x is 
lognormal^ is equivalent to stating that the distribution 
of log X is the. normal (Gaussian) distribution . Here 
'log* denotes the logarithm with respect to any con- 
veniently choseh base. The graph of a normal distribution 
is the well known *bell-shaped curve*. The level-structured 
lognormal distribution mode 1 of access sys terns described 
above can be equivalently viewed as a level-structured 
model for the logarithm of the size of bibliographic 
units such that the mean of the logarithms of the various 
levels are equally spaced and the associated distributions 
are normal, as shown in Figure 2.11 for levels 1-3. 

From that figure one also sees that the several bell 
curves have little overlap; this corresponds to the 
relative hor izontality of the lines in the previous 
Figure 2.10 which is another way of stating that the 
lognormal standard deviation is a small number. The 
converse possibility , which fortunately does not occur , 
is that the lognormal standard deviation be relatively 
large with the consequence that the normal distributions 
like those illustrated in Figure 2.11 would possess 
a large degree of overlap with the overall appearance 
of gentle waves uniformly spread over a sea rather the 
sharply defined and separated peaks and valleys that 
Figure 2.11 so clearly exhibits. What this means is that 
the notion of level for bibliographic units makes 
sense; almost all units of some given type are of a 
size that is closer to the level of that type than to 
any other level. For instance, from Figure 2.10 we can 
read that fewer than 0.05% (sic I) of the Tables of 
Contents are so large as to lie (in logarithmic measure) 
closer to level 3 (Indexes) than to level 2 (Tables 
of Contents); similarly, fewer than 0.2% of the Two 
Year College Libraries are so large that they lie closer 
(in logarithmic measure) to the average size of a 
university library than to the average size of a 
two-year college library. 



X 90 DIVISIONS 
KCUFFEL. a CSSER CO. 



ERiC 



cor~ jjLT)^ mruHicz)ar«o i>jjLnzrmrur-=io 

HlHI r*l 




50 



99.8 99.9 99.99 



o 

cr% 

70 



TD 

X 



o 


-H 


X 




> 


QD 




r” 


-H 


m 


m 




70 


o 




“n 


X 




m 


o 




o 




z 


M 


-H 


Z 


m 


O 


z 


uo 


-H 




U1 



O 

z 

UO 

3 m 
T3 Z 




r? 

X 




figure S-11 



s 



OME ACCESS distributions 
logarithmic variables 



IN 




M3 

51 



These observations suggest that the notion of boundary 
separating two adj acent levels should be introduced as 
that size corresponding to half integer values of the 
level. More precisely, with level n and size s(n) 
related as in eq. (2.2), we say that the size s(n+l/2) 
is the boundary size between s (n) and s (n+1) , and that 
(n+1/2) is the boundary between level n and level (n+1). 

With this notion in hand it becomes possible to analyze 
a bibliographic item in order to determine if its size 
co'incides reasonably with its 'proper* size, i.e., 
with the level of that type of bibliographic unit; 
from its size s compute log^^s and compare this number 
with the appropriate bibliographic unit level n to see 
whether logj,^s lies within + 1/2 of n; if it does not, 
then we may assert that the item of size s is either too 
large or too small. There will of course be specific 
exceptional instance for which the size of the unit is 
indeed 'proper* although not consistent with the statis- 
tically typical behavior for items of its bibliographic 
type, but the designer or evaluator of information 
access systems and/or information bearing data bases 
should, we think, warily approach the question of the 
size of a system from this point of view . 

The access model presented in this chapter is not 
restricted to the book and its subsystems and super- 
systems. There is considerable evidence that it 
reflects universal properties of information str d in 
written English form, and, in a slightly genera ^;ed 
version, may be still more broadly applicable ■ che 
analysis and modeling of other types of information 
systems such as those associated with the modalities 
of sensory perception. These wide ranging and difficult 
issues cannot be examined here in a serious way; more- 
over, we do not yet have sufficient data upon which a 
definitive report can be based. Some of the intriguing 
vignettes that are most directly related to Information 
presented in forms analogous to, if superficially 
distinct from, the book information system hierarchy 
explored above may nevertheless prove helpful for 
the reader. 

First consider the size relationships of component 
units of the serial publication archive. We have 
studied the mathematics journal subarchive with the 
following results. For 7445 j>apers reviewed in volume 
36 of Mathematical Reviews (published in 1968) , the mean 
length of an abstracted paper is 13.8 'pages'; here 
'page' refers to the myriad distinct page sizes and 
formats used by the 800-odd distinct journals reviewed 
by M athematical Reviews . Bearing this in mind , and 




44 



52 



noting that we have not attempted to directly determine 
the mean number of characters per page of mathematics 
text nor the effect of the numerous special symbols 
which extend the normal type font, use of our previous 
estimate of 2400 characters per page of text yields 
the estimate of 33,120 characters per mathematics paper; 
hence such a paper is of level 3. The mean length 
of an abstract in M athematical Reviews is easily 
estimated to be about 1081 characters . Therefore the 
size of the average mathematics paper is 30.6 times the 
size of the average abstract. Division of the esti- 
mated size of an abstract by K = 29.54 gives 36-59 
characters , which is about the size of the average 
mathematics journal paper title and is of course quite 
close to the level 1 mean of 29.54 characters- We 
conclude that journal papers in mathematics are 
structured in a manner which is consistent with the 
general model proposed for books - 

Next consider a more complex example which refers 
directly to the access problem- It is usual to find 
so-called ‘'subject headings” at the foot of library 
catalog cards which are intended to provide cross 
reference access to subject areas other than those 
associated with the class number of the item corresponding 
to the catalog card- There are nearly 93,000 subject 
headings in the L ibrary of Congr ess S ubject Headings , 
seventh edition T?l 9661~- A uniform 1/66 sample drawn 
from an alphabetized list of these headings shows that 
the mean number of characters per subject heading is 
22.3, which is not remarkably close to K = 29.54. 

However, the distribution of subject headings per catalog 
card as determined from an analysis of the Fondren 
Sample has a mean of 1.2 headings per card; if the 
distribution of subject headings per card is independent 
of the distribution of characters per subject heading, 
then the mean number of subject heading characters per 
catalog card, including the associated ordinals and 
interword space characters, will be. the product of the 
means of the component distributions , which is 29 - 16 - 
Hence the collection of subject headings per card 
provides about the same level of discrimination above 
the one-letter Library of Congress class in the mean 
that is provided by the title. Considering the dis- 
tributions of characters per subject heading and subject 
headings per card leads to the lognormal functions shown 
in Figure 2*12; we conclude that the subject heading 
access mechanism is consistent with the level structured 
model and it belongs to leval le 



PERCENTAGE 

2 % 5 10 15 20 30 40 50 60 70 80 85 90 95 98 % 




The phenomenon that the mean value of the size of adjacent 
access levels are in the ratio of about 30 to 1 is 
not confined to access systems associated with written 
natural language archives. Consider ALTEXT, a contemporary 
text-processing higher level (macro expander) computer 
language [5J. Such a language consists of computer 
instructions which have two parts: a generi c instruct ion 

such as the GOTO of FORTRAN which specifies the general 
function of the instruction, and certain other more 
particular components which contain the details of data 
location and transfers of control. The implementation 
of a higher level computer language instruction consists 
of a sequence of one or more "machine language" or 
"assembly language" instructions; the advantage of the 
higher level language is that it frees the programmer 
from the burden of keeping track of numerous housekeeping 
details concerning the location and manipulation of the 
data at the cost of lower (local) efficiencies of execution. 
This is another way of stating that the higher level 
language instructions act as an access system for the 
sequences of assembly language instructions that are 
their implementation . 

With this preamble in mind, one can examine the number 
of assembly language instructions required to implement 
each of the distinct generic higher level language 
instructions. For the generic instructions of ALTEXT, 
the mean number of assembly language instructions per 
ALTEXT "macro" is 30.? 2 (including implementation of the 
"ALTEXT macro" which provides the interface with the 
operating system of the implementing computer) Car 
implementation on the IBM 360/30 computer . TTic- re 2.13 
confirms in a rather startling way that di ^tribution 

of implementation size is lognormal ; hence we con j ecture 
that the level structured access model will probably 
find significant application in the design of computer 
languages . 

That the structure of many types of linguistic units 
is lognormal has Icng been known and abundantly verified. 

The lognormality of word length statistics was discovered 
at least as early as 1887 by Mendenhall [6] and was 
subsequently studied, along with sentence length distri- 
butions, inter alia, by Yule C7], Williams [8], and 
Herdan Yu 1 e Computed the sentence length distri- 

butions for a numu^r of samples of written English and 
although he did not notice their lognormality himself, 
Williams did test this hypothesis on Yule's data and on 
more he gathered himself. More extensive data has been 
collected by Kucera and Francis [10] but care must be 
exercized to insure that it is partitioned into homo- 
geneous subject and/or author classes before attempting 
to study the lognormality of the statistics ; the problem 




55 



47 



of describing the structure of inhomogeneous data , which 
amounts to studying how distinct lognormal distributions 
combine, is relatively complex. Moreover, much of the 
Kucera and Francis delta refers to printed materials that 
are unlikely to form an active part of an archival 
library collection; it is heavily weighted with fiction 
and press coverage. 

Hcrdan !9| analyzed 80,000 words of telephone conversa- 
tions collected by French, Carter and Koenig of the 
Bell 1’elephone Laboratories and concluded that (phonetic) 
word length is lognormally distributed. An indication 
that the parameters ;>f these linguistic distributions 
are relatively insensitive to variations in language 
vocabulary and to whether the written or spoken form is 
used is pro\ Ided by Figure 2.14 which shows nearly 
parallel lines representing the Herdan telephone 
conversations and Mendenhall's analysis of 1000 words 
from Shakespeare's works (as represented by Williams). 

These examples and others too numerous to report 
here . prompt us to speculate that the occurrence of tiie 
lognormal distribution is fundamental to all human 
information processing activities. In this regard we 
distinguish two types of activities: those that process 

direct sensory impressions that are received through the 
sensory organs, and those that process coded information 
such as is represented by linguistic codes. In the 
latter instance the directly perceived data arrives 
via the sensory organs but the essential content is 
unrelated to the particular code used for its transmission. 
Although there may be important differences between the 
internal mechanisms that process these two types of 
information , there are at leas t two characteristics that 
the two types of input information share: the quantity 

of information that passes through the processing system 
is very large and the system must be capable of responding 
to inputs whose size vary greatly. The first condition 
requires that the information processing system be 
able to compress (with information loss) the vast amount 
of data passing through it so as to be enabled to retain 
for future use a much smaller but characteristic subset 
of it; in other words, the processing system must 
function as an access system to the information passing 
through it. The second condition suggests that some 
functional transformation must be applied to the input 
sensory information in order to reduce its extended 
range to a smaller one more conveniently handled by 
the neural network; for example, there has long been 
evidence (which is reflected by the •decibel* scale 
of measurement) that the subjective response to the 
stimulus provided the ear by acoustic energy varies 
as the logarithm of the input energy. 




48 

56 



X 3 CYCLE LOG. 

POEurrm. e e^mer co, BAestu 




Figure 2.14 



Word Length Distributions 
(in characters ) 



PERCfTNTAGE 

2 '';. 5 10 15 20 30 40 50 60 70 80 85 90 95 




58 

o 

ERIC 



50 



Generally, there are three reasons for making a scale 
transformation in analyzing data (e.g., see Tukey Lll])s 

1- To linearize the relation between two variables. 

2. TO;normalize the underlying probability 
distribution . 

3. To stabilize the variance - 

Althougli in most applications any one of these results 
would provide sufficient reason for introducing a 
particular transformation, it is not uncommon to encounter 
situations where the transformation is originally 
introduced for one reason and subsequent analysis shows 
one or both of the remaining desiderata have also been 
achieved- 

In this context it is illuminating to study the work 
of the nineteenth century experimental psychologist 
G- Fechner [12], He made the important observation 
that the ability of the human to respond a stimulus 
is proportional to the mean level of the stimulus - 
That is, if an individual can just sense a difference 
of, say, one unit when the mean level of stimulation is 
10 units, then he will also just be able to detect a 
difference of 2 units when the mean level is 20 units - 
This multiplicative proper* ty of the just noticeable 
difference led him to introduce the logarithm function 
in order to stabilize the variance, i . e . , make it 
constant throughout the range of perception. He then 
conjectured that the function relating subjective 
response to the transformed var iable~-*the logarithm of 
the stimulus ~~ is a linear function, thus arriving at 
the celebrated (and once again hotly debated) *Law* 
of Weber and Fechner. The reader will observe that the 
logarithm of the size of bibliographic units stabilizes 
the variance of the distributions of these units through- 
out the entire range of 'bibliographic perception*. 

This certainly makes it tempting to inquire whether the 
Wcber-I''echner *Law" might not be merely an approximation 
to some more accurate description of the underlying 
functional transformation governing sensory perception. 
This question has received considerable attention in 
recent years and notable contributions have been made, 
principally by Stevens ( e . g . , C13]) , who has generalized 
the logarithmic Weber-Fechner transformation so that 
response is some power of stimulus; that this change 
actually constitutes a generalization becomes clear when 
it is noted that the integral of 1/x is log x whereas 
the integral of any other power of x is agdin a power 
of x; in this sense the logarithm is the limit of power 




51 

59 



functions (see Dolby C14]). The relationship between 
linguistic and hence bibliographic units and these 
psychophysical questions has been remarked by several 
workers, most notably perhaps by Fairthorne [15]; 

Zipf 3 'Law' [16] in its integrated form is just the 
Weber-Fechner logarithmic relation, and Mandelbrot's 
[17] generalization of Sipf's function corresponds — 
indeed, it is identical to — Steven's power function. 

These questions will be taken up from a more mathematical 
standpoint in the next chapter with the intent of 
showing how they can be derived, following an argument 
essentially due to Mandelbrot, from elementary 
considerations from information theory, and, of 
more importance for our purposes, that a slight exten- 
sion of this argument generalizes the Weber-Fechner- 
Zipf-Stevens-Mandelbrot functions to the lognormal 
distribution. For as the extensive bibliographic 
data assembled in the earlier parts of this chapter 
show, it is the lognormal function that in fact describes 
reality. 



References 

1. Dolby, J. L. , V. Forsyth, and H. L. Resnikoff, 

Computerized Library Catalogs: Their Growth, Cost 
and Utility , M.I.T. Press, Cambridge, 1969 . 

2. Dolby, J. L. and W. J. Jones, "The Measurement of Com- 

position Practice" , in Advances in Computer Type - j 

se ttting . Institute of Printing, London , 1966. 

3., Price, Bronson, Library Statistics of Colleges and 

Universities, Fall 1969, Data for Individual Insti - 
tutions, U. S, Off f ice of Education, Washington, 

D. C., 19 70. 

4. Youden, W. J. , Statistical Methods for Chemists , John 

Wiley 6 Sons, New York, 1951. | 

t 

5. Dolby, J. L. , W. E. Houchin, Roger Stark, and H. L. 

Resnikoff, Non-Numeric Programming Language Studies; 

ALTEXT II , Final Report to U.S.A.F. Office of 
Scientific Research, R & D Consultants Co., 

Los Altos, California, 1970. 

6. Mendenhall, T. C.,"The Characteristic Curves of Compo- 

sition", Science , 9 , (214 , supplement) (1887), 237-49' 

7. Yule, G. U., "On Sentence-Length as a Statistical Char- , 

acteristic of Style in Prose", Biometrikn 30(1939) , 363-84 




52 



60 ( 






8. Williams, C, B., "A Note on the Statistical Analysis 

of Sen ten ce- Length as a Criterion of Literary 
Style”, Biometrika , 31(1940), 356-61. 

9, Herdan, G., ”The Relation between the Dictionary 

Distribution and the Occurrence Distribution 
of Word Length and its Importance for the Study 
of Quantitcitive Linguistics”, Biometrika, 45 
(1958) , 222-8. 

10. Kucera, Henry and W. N. Francis, Computational 

Analysis of Present-Day American English , Brown 
University Press , Providence , 1967 . 

11. Tukey , J. W. , "On the Comparative Anatomy of Trans- 

formations" , Annals of Mathematical Statistics , 
28(1957) , 602-32 . 

12. Fochnor, G. T. , Elemente der Psychophysik , 1860. 

13. Stevens, S. S., "Neural Events and the Psychophysical 

Law" , Science, 170(1970), 1043-50. 



14. Dolby, J. L. r "A Quick Method for Choosing a 

Transformation", Technometrics , 5(1963) , 317-25. 

15. Fairthorne, R. A., "Empirical Hyperbolic Distri- 

butions (Bradford-Zipf-Mandelbrot) for Bib lio- 
motric Description and Prediction" , J ournal of 
Documentation , 25(1969) , 319-43. 

16. Zipf, G. K., Psycho-Biology of Language, Houghton 

Mifflin, 1935. 

17. Mandelbrot, B., "An Information Theory of the 

Statistical Structure of Language" , Proceedings 
of the Symposium on Applications of Communication 
Theory, London, September 1952 , Butteirworth , 1953 , 
486-500 . 




I 



53 

61 * 



CHAPTER III 



mathematics of 

INFORMATION DISTRIBUTIONS 



ERiC 

' SBaifflifftiiTiTiaiJ 



63 



■w 

O 



MATHEMATICS OF 
INFORMATION DISTRIBUTIONS 



This chapter is devoted to the mathematical study of 
some of the distributions that arisr; natura.lly in the 
study of information systems- It will necessarily 
be more demanding of the reader's mathematical knov/ledge 
than the remainder of the book and has therefore been 
written so as to permit the reader to pass immediately 
to Chapter IV v/ithout loss of continuity. We believe, 
however , that the significance and implications of the 
level structured model of access systems presented 
in Chapter II cannot be fully understood unless the 
relationship of that model to other competing models, 
extant and potential, is made clear- Moreover, the 
most powerful theoretical arguments for the appear- 
ance of the lognormal distribution in the model struc- 
ture come from information theory and its mathematical 
apparatus, so there is really no way to avoid these 
technical considerations - 

We will be principally concerned with tv;o dis tributions-- 
the Zipf and the lognormal . The former is also frequently 
associated with the names Estoup (1) , Bradford (2) , 
and more recently Mandelbrot (3,4)- Zipf rediscovered 
and popularized the observation that the rainked frequency 
distribution of words in natural text corpora is essentially 
of the form 

y = cx ^ (3 -].) 

where x denotes the rank and y(x) the frequency of 
occurrence of the word of rank x; here c and s are 
constants selected to fit the data as nearly as possible 
and which therefore are characteristic of the text 
corpus and to some extent the language from which it 
is drawn; Zipf only considered the case s = 1- 

Figure 3.1, taken from Zipf (5), exhibits such distr ibutions - 

Distributions of the type (3.1) occur in other fields, 
associated, for special values of s, with Pareto (6) 
in economics, Lotka (7) in what might be termed 
'sociological mathematics', and more recently De Solla 
Price (8) , and no doubt in numerous other contexts 
as well. 

That the Zipf "law" is taken seriously, not just considered 
as an accidental quirk of the data to be remarked upon 
and ignored, is attested by the variety of publicatons 
that dispute, modify, and reduce it to a triviality. 
Mandelbrot showed that a slight generalization of the 









^1-i 1- 




'~Tl 


-i-1. 






:: 




'c; 




























— }- 




_L_1 


i i ! 










1 


"T1 


































L_1 . 






— t 


























1 




- 




























1 


1 ! I 
























* 






1 






















1 


1 1 






















i 




























































































1000 




































































r-*- 




























1 






- 














11 


J ~ 


A iL*r« A t>rk r^tmvc* nip 




1 




L_ 


1 








1. 


\ 


i 






,n 










\[ 




1 








1 


1 










. 




1 






















































- 


'l 


"'V 

1W. 








kJ 


1 

I 


r 




PL.AUT1NE Latin Words: — 






100 












% 

V 










































1 






•> 












































:| 


L] 








_! 


P 






































”| 


1. 




1 


i 

1 


1 1 


















_i 


























1 










1 














1 1 












- 
















_ 










1 \ 










































- 




- 




\ 


b 


























10 
























> 








































^ T - 





















__i — 


_ 








1 — 






^ 




-r*-T; i 1 rrrrr 


1 — 1 




“n 


-njT- j 


— 1 


















1 




1 1 1 






» ' 1 


T 1 










mz 
























































1 














































1 












































































!| 


i'.. 














1 
















10 














100 














1000 




\ 













Figure 3.1 




56 




Zip£ law, in better agreement with the data, is a con- 
sequence of elementary arguments and reasonable, hypotheses 
about the effort required for the efficient transmission 
of information; in this sense he is a bulwark for botn 
the proselytizers and trivializers of Zipf since 
his arguments convincingly show that the nature of the 
distribution has nothing to do with special properties 
of language that distinguish it from a variety of other 
processes that extremize some function representing 
the degree of organization of the process in a statisti- 
cal sense. Mandelbrot was quick to point out the con- 
nection, which is more than merely formal, between 
his result and the mathematical methods used to derive 
it , and the der ivation of the partition function in 
statistical mechanics; cg^. Schrodinger (9). 

Despite the considerable research efforts that have 
gone into understanding and improving the relation 
of Zi'if, there are significant discrepancies between 
the Zipf -Mandelbrot predictions and the observed data 
for large samples or words drawn from natural 2 ^nguage 
text corpora and for other data collections as \vell- 
There are theoretical difficulties too: Yule (10) 

observed that the sum of the frequencies predicted 
by the Zipf -Mandelbrot distribution (with s = 1) is not 
finite, which implies that there must be a significant 
deviation from this distribution for large values of 
the variable. This kind of difficulty is not as easily 
brushed aside as disagreement with the data can be, 
for it entails an unknown mechanism which determines 
tliat range of the variable for which the distribution 
must be modified as well as the unknown modification 
itself, and leaves the researcher bereft of the argu- 
ment that improved ’’experimental measurements” will 
modify the situation in any agreeable way. It is much 
easier to reconcile ill fitted observations, and their 
consequences arc normally much more local in nature . 

Nevertheless it has been found desirable to modify 
Zipf’s Law in many ways to better fit the data. Mandel- 
brot’s modification, based on his theoretical consider- 
ation‘s, is 



y = c (x - a) 



(3.2) 



where a is a small constant. . Belonogov (11 ) found 
that the distribution 

-c (x-1)^ -cx^ 
y = e -e 





(3.3) 



describes the rank-frequency structure of printed 
commercial Russian. Good (12) is led to 



y 



c (x-a) 



-s (1 



by"^) 



(3.4) 



with b a small constant? this form has (3.2) as a first 
approximation (because b is small) and also is respon- 
sive to Yule's criticism since the sum of the frequencies 
is finite. It is derived by including in the effort 
function (see below) a factor coresponding to the 
effort required to incorporate words of large rank in 
the inventory , and represents ^ in a certain sense , 
part of the system 'overhead*. Unfortunately, (3.4) 
is a complicated expression and Good's choice of 
overhead factor is in no way uniquely determined . 

Other authors have turned to functions that are appar- 
ently quite different in order to more faithfully 
describe their data., Houston and Wall (13) described 
the distribution of term usage in manipulative indexes 
using the lognormal distribution, eliciting from Fair- 
thorne (14) the remark that, in his view. Wall (15) 

(and presumably also Houston and Wall (13)) selected 
the lognormal only because the data was well fit by 
that distribution in the sense that the results plotted 
as a straight line on lognormal probability graph 
paper, but that they would r.lso have done so on 
ordinary logarithmic graph paper because "segments 
of the tail of a Gaussian distribution are not readily 
distinguished from segments of a hyperbolic distri- 
bution"? by the latter he means the Zipf distribution. 
Certainly this remark applies in principle to the figures 
plotted on lognormal graph paper in the previous Chapter, 
but we will see th^'t it is significant only when the 
variance of the lognormal distribution in question 
is large. 

Carroll (16) has discussed the statistical problems 
associated with representation of the Standard Sample 
of Present-Day Edited American English (17) by lognormal 
distributions? there is in his work no hint of the 
formerly used Zipf approximation. 

We think that there ax'e two conclusions that should 
be drawn from this necessarily brief survey : first , 

the Zipf -Mandelbrot distribution does not adequately 
fit much data although it is well grounded in theory, 
and second, it is often difficult to distinguish log- 
normal approximations of data from Zipf approximations. 
They suggest that an intimate relation may exist connect- 
ing the Zipf and lognormal distributions, and, if this 



58 

':M66 



be true / that a 'derivation* of the lognormal from 
elementary principles along the lines of Mandelbrot's 
arguments may be possible. 

In order to show that these hopes are indeed justified ^ 
we will present a derivation of the Zipf -Mandelbrot 
Law following the usual argument, and in effect follow- 
ing Schrddinger (9), although the notation and termin- 
ology there is of course quite different. 

Consider information * states * S, , , - - • ^ - 

constituting some inventory, such as the words of 
laii^uage as they occur in some large text corpus, or 
a large random collection of monograph titles or 
indexes, ordered in some convenient fashion. Let 
e(x) denote the 'effort* ('energy') required to utilize 
state in an access system ( ' communication system* ) , 
and denote by p(x) the probability of utilization 
of in the inventory. Following Shannon (18), 
thje expected amount of information per unit expected 
effort is proportional to 



I = - 2]p (x) log p(x) /53 p(x)e(x). (3.5) 



If the access system is such that the expected amount 
of information per unit effort is maximized, then the 
probabilities p(x) cannot be unrelated to the effort 
function e(x) ; maximization of I subject to the nec- 
essary restraint 



1 (3-6) 

will determine the form of p(x) for given e (x) . Now 
maximization of (3.5’ subject to (3.6) is equivalent 
to maximization of 



“ L p p 



(3o7) 



subject to (3.6) and the additional restraint that the 
total effort 

J]p (x) e (x) 

is constant as well. The mathematical method of 
Lagrange multipliers provides the solution to this 
extremal problem in the following way: since the total 

probability (3.6) and the total effort are constant, 
the function (3,7) and the function 



H 




p (x) log p (x) 



+ 



(l+a^) 




+ 



p (x) c (x) 



(3.8) 



attain their maximum for the same functions p(x) 
of e(x), where a^ and a^ are arbitrary constants and 
the form (l+a^) Has been chosen for later notational 
convenience. ^Now subject each p{x) to small differ- 
entiable independent functional variations, all the 
while keeping x fixed; H will assume its maximum 
where the derivatives 0fl/9p(x) all vanish. This yields 
the simultaneous conditions 

0 = 9 II/ 9 p (x) = - (1 t log p (x) ) + 

(l+a^) + (x) (3 .9) 



which implies 



log p(x) = Uq + a^r (x) J (3.10) 

This is the fundamental relation connecting the effort 
function, which is presumed to be known, with the 
probability of occurrence of the state S . It remains 
to, specify the effort function. Mandelbrot argued 
that, if the states S are words drawn from natural 
text and arranged in decreasing frequency of occurrence, 
then e(x) is proportional to log(x-a) with a some 
small constant . This hypothesis immediately leads to 
(3.2) with c = e ° and s = -a- . In order that the 

distribution decrease with increasing x, a, must be 
negative. (If a^ = 0, the distribution degenerates 
into the uniform distribution , which can only apply 
to a finite range of the variable x.) 

The idea underlying Mandelbrot * s choice of effort 
function is perhaps most simply illustrated by recallin>^ 
the ordinary use of positional notation to represent 
positive integers . If b is an integer greater than 1 
and n is any positive integer, then n has a unique 
representation of the form 

N k 

n = a^^b + + aj^b *f . . . + a^b + a^ 

with a. integers less than b but not negative . For 
instance, if b 10 and n == 234, then 



60 




234 



= 2-10^ + 3-10 + 4 . 



By means of such an expression n can be identified with 
the sequence of numbers a^^a^^^^ . . . a^ . If b = 10 
this correspondence is the usual decimal expression 
for n, while if b - 2, it is the binary expression. 

Such an expression for n requires N symbols each of 
which is selected from an inventory of b symbols 
(0,1,2, ...yb-1). Evidently 



N + 1 ^ log^ n N, 



so the number of places required to express n in base 
b is approximately log^^n, and approximately b log. n 
selections suffice to specify an integer lying between 

0 and n. 

By coding the information specified by the state^j S ^ 
as integers, this argument can be made to apply to ^'the S 
themselves, leading to the Zipf -Mandelbrot distribution 
for words if that is what the states represent - 

It must be recognized that more is involved in the 
distribution of information states than the simple 
matter of minimal coding; Good's argument mentioned 
above attempts to account to some extent for the 
effort required to add a state to the inventory, 

1 e , , to learn a rare word. Therefore the effort 
function may not have the form proposed by Mandelbrot 
except in the simplest of cases ^ and it becomes necessary 
to investigate the probable nature of substitutes 

for it. It might be argued that in general a multiple 
of log (x-a ) will constitute a good first approximation 
to g(x). This, and equation (3.10), suggest the 
introduction of the variables 

u = log (x-a) (3.11) 



and 



f(u) = log p(x) . (3.12) 

so that (3.10) becomes 

f (u) = a^ + a^e (e^+a) = e*(u), (3.13) 

defining the function (u) which is more convenient to 
work with. 




61 




Mandelbrot's assumption for the effort function is, 
in this notation, simply that 

^*(u) = + a^u . (3.14) 



We will assume that t:* (u) can be expanded in a Taylor 
series about the point u = 0 , that is , 

tXJ 

e*(u) = ^ . (3.15) 

k=0 



For small values of u, e*(u) will be well approximated 
by the first two terms of (3 .15) if a. is not 5iero, 
leading to the Zipf -Mandelbrot Law; if a better approxi 
mat ion is desired , more terms must be taken from the 
series expansion of e* (u) . Suppose for instance that 
an approximation accurate through terms quadratic 
in u is used: 



2 

G* (u) = a^ + aj^u + a 2 U 
Using (3.11) through (3.13), we obtain 

log p(x) = a^ + a^log(x-a) *f (log (x-a) ) ‘‘^ ; 
the right hand side can be written as 

-log (x-a) + j |log (x-a) + ( 1 +a^) /2a2 

SO 

^ I log (x-a) + (1+aj^) /2a2 
p(x) = c e ^ I ^-l/2a^ 

(x-a) 



= ce 




log (x-a) -m 
s 



(x-a) 



(3.16 




70 : ; 

62 



with 



% 

(4a- a - (1+a, ) ^) /4a 

^ (3.17) 

which is the lognormal distribution with lognormal mean 
m = (1 + a^)/2a2 (3-18) 



and lognormal standard deviation 



s = / -l/2a2 (3.19) 

In other words , the parameters a. and a« which appear 
in the effort function are'^relatea to the 

parameters of the lognormal distribution defined by 
that effort function as follows: 



= -(1 + "2) ^ ^2 ~l/2s^ 



(3-20) 



showing in particular that a .2 must be negc'tive 
in order that the distribution correspond to a realisable 
system- Using these values for the constants appearing 
in the effort function yields 



e* (u) = - (1 + m/s^) u ~ u^/2s^ .. (3-21) 




which shows that Mandelbrot's hypothesis is warranted 
wheg s and m are large in such a way that the quotient 
m/s^ remains finite. If s is not large ^ then neces- 
sarily the simple logarithmic effort hypothesis is 
inadequate . 

This observation reconciles Fairthorne * s remark , 
quoted above ^ but it has the further reaching consequence 
that the question of whether the Zipf or the lognormal 
provides a 'better* fit to given data is really 
meaningless from this point of view; the Z ipf is a 
special case of the lognormal and can therefore never 
provide a better fit by the implied metric than the 
latter- Moreover^ the larger s is^ the easier it will 
be to confuse the two distributions^ since the general 
form will more nearly approach its specialization as 
s increases* 

71 

63 



The derivation of the lognormal distribution given 

above is based on the measure , I , of information per 

unit effort defined by eq(3.5) which occurs in Mandelbrot's 

work and is also used by Good . It is , however , not 

the only reasonable measure of average effort. In 

fact, as W. E. Houchin has observed in a personal 

communication, the choice of 



I* = 



p (x) log p (x) /e (x) 



the expected value of the information per unit effort, 
in place of I (which is the expected value of the 
information per expected value of the effort) leads to 
the lognormal function in a more direct manner, free 
of the burdensome hypothesis that the effort function 
is quadratic in the logarithm of size or rank. For, 
arguing as before with I* in place of I, leads to the 
maximization of 

H* = (x) I log p(x)/e(x)| + a*^^p(x) 



+ a*y^p (x)e (x) 

and therefore to the equations 






0 = e (x) 9H*/8p (x) = -(1+ log p (x) ) + a*e (x) + 

•'r a:^e(x)^ , 



with solution 



log p(x) = -1 + aje (x) + a^e^(x) . 



If tho effort is a logarithmic function of x, c (x) = log (x-a) , 
then p(x) is the lognormal function with lognormal mean 



m* = (l+a*)/2a* 

and lognormal standard deviation 
s* = /-l/2aj~ ; 



72 



these equation^ should be compared with eqs (3.18) 
and (3.19). If m* = 0, then this lognormal function 
reduces to Zipf's original *law* with exponent -1; 
if m* and s'^ are both large such that m*/s"^ is finite 
the general power function of Mandelbrot and Stevens 
results. 

Recalling that f (u) = log p (x) = c* (u) , (3 -21; shows 

that the graph of f (u) u, that is, of log p(x) 

vs . log(x-a), will be a parabola if p(x) is lognormally 
distributed, whereas it will be a straight line if 
p(x) is distributed according to Zipf*s L'aw , It is 
instructive to examine some examples . 

Figure 3.2 displays the graph of the frec|uency of the 
most frequent ordered pairs of words drawn from English 
language text as a function of their rank, drawn on 
logarithmic graph paper. The data was derived froa 
the corpus which constitutes the Standard Sample of 
Present-Day Edited American English (17). It aporoxi- 
mates a straight line without any considerable c idcnce 
of global curvature thereby supporting the hypotnesis 
that the effort function e* (u) is linear and the 
consequent distribution Zipf. The arguments that are 
usually applied to justify the Zipf approximation for 
the distribution of frequency ranked single words would 
seem to apply equally well to this case , and therefore 
the arguments of Carroll (16) concerning the pro^>lems 
associated with the extraction of finite samples 
from theoretical distributions of lognormal type are 
also probably valid here, which helps to explain the 
bending of the curve in the direction of low frequencies 
for large ranks. 

Next consider the distribution of the number of index 
entries in monographs, shown in Figure 3.3. The data 
is drawn from the Fondren Index Sample, which is described 
in detail in Chapter IV. Departure from linearity is 
clearly exhibited; this data is also shown in Figure 2.3, 
where it is plotted on lognormal probability paper with 
striking results which suggest that the points in 
Figure 3.3 should approximate a parabola. A portion 
of the parabola that fits this data is shown in the 
figure. The reader will notice certain peculiarities 
of tho distribution of data points that are character- 
istic of this type of problem and lead to difficulties 
of estimation. First of all, small values of the 
independent variable --the number of index entries in this 
case---correspond to few data points if the data has 
been grouped for calculational convenience as these 
data have been. On the other hand, large values of the 



73 

65 .. 



FREQUENCY 



i; 1 ^ I r\ 1 b u I 1 u | vi I Oi I i r\ l w u i. n i 



ORDEKEID PAIRS OF ENGLISH UORDS 




(N 

ro 



Figure 



Figure 3.3 

INDEX ENTRY DISTRIBUTION 
FROfI 

FONDREN INDEX SAfIPLE 





75 



67 



NUflBER OF INDEX ENTRIES IN HONOCRAPH 



independent variable correspond to many data points 
even after grouping if the group intervals are of uni- 
form size and for many of these the corresponding 
frequencies will coincide, loading to vertical 
'segments* such as appear over the *1*, *2*, and *3* 

monograph markers. It is the geometric mean of these 
values that is important if the distribution is in 
fact lognormal . 



More complex phenomena sometimes occur , and are perhaps 
most easily initially analyzed by studying the nature 
of the polynomial functions that approximate them, or 
at least portions of them, in the variables u and f (u) . 
Consider, for example, the distribution of the number 
of pages in a monograph, shown in Figure 3-4. The 
data is drawn from the Fondren Sample (c£* Dolby, et. 
al. (19)) and has been grouped. From the figure one 
sees that monographs of fewer than 220 pages appear 
to follow the Zipf distribution whereas longer monographs 
have lengths that are well approximated by part of 
a lognormal distribution since they correspond to data 
points that fall nearly on part of a parabola as 
shown in the figure? the small arrows indicate computed 
values of points lying on the fitting parabola. We 
have no satisfactory explanation for this curious 
discontinuity in the effort function which describes 
the distribution of lengths of monographs which maxi- 
mizes information per unit effort. Extraneous factors, 
perhaps related to the technology and economics of 
printing, are probably responsible but we have thus 
far been unable to isolate them. The reader should, 
however, compare Figures 2.5 and 2.6 which show the 
page distribution for unindexed and indexed monographs 
in the Fondren Sample. 



Now consider the problem of determining the parameters 
of a lognormal distribution which represents given 
data. Take the equation of the lognormal in the form 



1 ( log fx-a) m 

2 i 



y= 



N ^e 

s (x - a) /Jtt 



(3.22) 



If the data consists of sample measurements {x.} such 
that the sample frequency of occurrence of X. is y., 
then N is just the total number of measurements s 



N = E y^. 



(3.23) 



7B 



68 



Figure 3.4 

PAGE DISTRIBUTION 
FROM 

FONDREN SAMPLE 




ERIC 



69 



77 



..i n u V a V V w j v 



The values of a, m, and s are determined by introducing 
an auxiliary quantity related to the skewness of the 
sample distribution, to whose definition we now turn. 

t h 

Some more terminology is necessary. Define the 
sample moment by 



“k - i *i yi 

N 



y’ is the usual mean of the sample. If the sample 
moments are known, the central moments 






(Xi- 



I \ ^ 

v\) y. 



can be calculated. Expressions for the first four 
central moments in terms of the sample moments are seful 
when considering lognormal distributions. The first 
central moment is evidently zero, since 

Wl - I E (x^ - 



= (1/N) - (y|/N) E y^ 




The next three central moments are given by the relations 



^2 


= - 


(y{)^ ; 




li3 


”3 


3y*yi + 


2(u{)^ ; 


^4 


= - 


4y «y • + 


6u^(up^ - 3(up^ . 


The positive 


square 


root of 


VI 2 is usvtally called the 



standard deviation and denoted by a : 




(T =/vi2 



\ 

70 

78 



Introduce the ratios 



o - 2.3 

3i /U 2 



(3.24) 



and 



3*-) y^/y2 



2 



(3.25) 



is called the skewness of the sample; it provides a 

simple measure of the departure, from symit^etry about its 

mean. The skewness of the normal distribution ^ as 

for all symmetrical distributions ^ is 0 . ^2 sometimes 

known as the kurtosity of the sample distribution. If 
32 3^ it is lower and flatter. The kurtosity of a 

normal distribution is 3. 

In the literature, and unfortunately also in tables, 
other formulae are sometimes used to define quantities 
known as skewness and kurtosity. It is common, for 
instance, to find 



called 'skewness' (note that the sign of Yt is the same 
as the sign of y^) , and 



is sometimes called 'kurtosity*. We follow Karl Pearson’s 
usage, as found for instance *n Ref. (20). 

Skewness and kurtosity are of interest because they 
provide simple measures of the deviation of a sample 
distribution from the normal distribution and can be 
used to determine a family of distributions likely to 
provide an accurate and practical representation of 
data exhibiting skew variation; this procedure was 
introduced by Pearson (21) . We will later have occasion 
to compare the lognormal representation of data occurring 
in information systems with representations by means 
of Pearson ' s distributions . 



Yi = y3/a^ = 




79 




71 



Skewness is of immediate interest here because the 
three numbers ]i^, P 2 r determine the lognormal distri- 
bution that best fits given sample data. The unknown 
parameters a, m^ and s of (3.22) can be expressed by 
means of an auxiliary quantity n which is the re:?l 
root of the equation 

+ 3n - Yjl = 0 (3.27) 

where is the square root of the skewness (with the 

correct sign) as defined in (3.25). It is easy to 
see that there is in fact just one real root to (3.27), 
for otherwise there must be three, so the graph of 
the left side of (3.27) would have two turning points, 
which means that the derivative of the left side would 
have two real roots. But the derivative is 3n^+3, 
whose roots are pure imaginary. 

The unique real root of the cubic equation (3.27) is 
readily and accurately approximated by using Newton's 
method. First select some reasonable approximation 
to the root? if a better choice is lacking, set 



= Yj^/3 



t h 

If is the approximation to the root Hr then the 

next approximation is 



= + (Yi - 3 Hv -nv )/(3r)^" + Z,. (3.28) 



For example, if = 3, then = 3/3 = 1, and 

n, = 1 + (3 - 3 - l)/(3 + 3} = 0.833333... , 



03 = 0.795556 , 

= 0.817973 , 

hg = 0.817732 = Tig ; 



817732 correct to six places. 




72 

80 ) 



therefore n = 



Given a sufficiently accurate value of the parameters 
of the lognormal distribution (3.22) are .(cf. Cramer, 
Ref. (22) ) : 



a = - a/n , 

s = {log (1 + 



(3.29) 



(3.30) 



1 2 

m = log (a/n) " ^ 



(3.31) 



Recall that a = /y 2 standard deviation. 

If the parameters of a lognormal distribution are known, 
it is of course possible to calculate the kurtosity 
of that distribution; the result is expressible in 
terms of n as 



and is of no particular value except that it permits 
one to sketch the graph of the skewness and kurtosity 
pairs that can belong to lognormal distributions. 

Figure 3.5 shows such a graph. The skewness and kurtosity 
of a particular data sample determine a point in the 
plane; the farther this point is from the lognormal 
curve, the less likely is the hypothesis that the sample 
data is dravm from a lognormal distribution . 

the usual techniques of the differential 
calculus it is easily shown that the maximum value of 
the lognormal distribution (3.22) is attained when 



Some examples showing how these equations are to be 
applied to sample data may have some interest for the reader. 

Consider first the distribiition of the number of entries 
in a monograph index. For data drawn from the Fondren 
Index Sample , Figure 3 . 3 shows that this distribution 
apparently is lognormal. The follov/ing table, which 
summarizes the data in Table 4.3, groups the sample 
measurements according to intervals of 500 index 
entries and nominally associates them with the center 
value in each interval. This coarse grouping scheme 



^2 = n® + 6n® + I5n^ + I6n^ + 3 , 



(3.32) 



2 



X a + e 



m-s 



(3.33) 



ERIC 



73 




FN-155 (8'50) GENERAL ELECTRIC COMPANY. SCHENECTADY. N, Y.. U.S.A. 160X220 29/6-1 Inch Divisions 




ERIC 



82 



decreases the ijumber of categories to the point where 
the necessary calculations can be conveniently performed 
on a desktop calculator. 



Table 3.1 

GROUPED NUMBER OF 
INDEX ENTRIES FOR MONOGRAPHS 



Nominal No. 
of Entries 

250 

750 

1.250 

1.750 

2.250 

2.750 

3.250 

3.750 

4.250 

4.750 

5.250 

5.750 

6.250 

6. 750 

7.250 



Number of 
Indexes 

326 

214 

76 

36 

17 

9 

6 

8 

2 

6 

1 

1 

2 

1 

1 



One finds 

N = 706 

|j£ = 831.444759 
^2 = 878,281,765706, O = 
\i^ = 2,551,331,421.74 
P 4 = 13,238,646,211,200 
to twelve figures. Consequently 





937.166882 



3.0997 



^1 = 

= 9.G080 

(^2 = 17.1S23 . 

According to Figure 3.5, the sample value of corres- 
ponds to (^2 24 if the distribution is lognormal. 

The disagreement is nONt serious in view of the effect 
of grouping the data and the small size of the sample; 
regarding the latter point, J. Carroll's remarks in 
Reference (16) are instructive. For the original 
ungrouped data, it turns ^out that = 836, so the 
effect of grouping the data is not entirely negligible. 

These estimates imply n = 0\837450 correct to six 
figures; indeed, since is nearly equal to 3 , it 

is a good idea to select the solution to (3.27) 
previously calculated as an example as the starting 
value of the approximation procedure, leading to 

== 0.817732, 

== 0.837642, 

= 0,837450 = - n - 



Substitution in (3 . 29) - (3 . 31) produces the parameters 
of the lognormal: 

a = -287.6 , 

s = 0.7284 , 

m == 6o755 , 



so 



706 

0*7284 (x+287.6) /2T 



1 j log(x+287.6)-6.75!5F 

2 I 0.7281 j 




76 



84 



j^C 



y 



e 



Observe that 



c"’ = 858 = 874 = K^, 



According to these calculations, the (grouped) number 
of index entries behaves as if approximately 288 entries 
are 'missing* from indexes. To what extent this must 
be attributed to the effect of grouping the data so 
coarsely we have not attempted to determine except 
to note that the intervals that were chosen will tend 
to produce this type of qualitative effect because 
most of the indexes repres tinted in the first group 
(those having fewer than 500 entries) contain more 
than the 250 entries nominally ascribed to that cate- 
gory, thus biassing the distribution toward low values. 

The next example is a particularly useful pedagogical 
illustration because the data are unusually regular 
and occur in .large number in a form convenient for 
calculations , but it is of cons id er able independent 
interest as well. Consider the distribution of 
dictionary words according to the number of vowel 
strings they contain. The number of vowel strings, 
in the technical sense in which it is used here, is 
a graphemic substitute for the phonemic notion of the 
number of syllables contained in the spoken f otm of 
a word? the precise definition we use is given in 
Reference (23) , but v/ill not be necessary for our present 
purposes since the intuitive correspondence with the 
notion of syllable is sufficiently accurate. The 
words under consideration are the 64,041 lexed words 
of the Shorter Oxford Dictionary which contain at least 
one vow^l string . Figure 3.6 displays the data on 
bi-logar ithmic graph paper? the general parabolic 
tendency is apparent. From the data given in Table 3.2 
one readily calculates 

N = 64,041 , 

11 ^ = 2 .6889 

^2 = 1.1096 , a = 1.0534 , 

= 0.5027 f 

M 4 = 3.7011 , 



77 



85 




IQ- 



ID 



ID" 



10 



ID 



TABLE 3 . 2 



DISTRIBUTION OF LEXED WORDS 



FROM THE 
BY 

Number of 
Vowel Strings 


SHORTER OXFORD DICTIONARY 
NUMBER OF VOWEL STRINGS 

Observed 
Number of Words 


Calculated 
Number of Words 


0 


63 


285 


1 


7,158 


6,618 


2 


22,568 


22,160 


3 


20,762 


22,072 


4 


10,293 


9,737 


5 


2,770 


2,531 


6 


393 


691 


7 


30 


178 


8 


4 


24 



So 



= 0.4301 , 
3j^ = 0.1850 , 
^2 = 3,0060 . 



For a lognormal distribution, this value of skewness 
implies a kurtosity approximately equal to 3.3, 
which agrees reasonably well with the value computed 
from the sample. The calculated parameters of the 
lognormal fitting these data are 



a = 4.7086 , 

s = 0.1407 , (3.34) 

m = 1 . 9916 . 



The maximum value of this lognormal distribution is 
y = 25,000 and occurs at x = 2.46. The rightmost column 
of Table 3 . 2 shows the niimber of words as a function 
of the number of vowel strings as calculated from the 
distribution defined by the parameters given in (3.34) 
above. 

O 

ERIC 



79 

87 



Some degree of caution must be exercised when one 
attempts to determine if data can be reasonably fitted 
by a lognormal function. If the entire range of the 
variable is not represented by the data due either to 
unfortunate grouping or absence of information , graphical 
representation of the data may be misleading. We will 
describe one way in which this can happen. Suppose that 
((x,y)} is a data sample exhibiting the frequency function 
of some variable x. Let {(x^Y)} denote the cumulative 
frequency distribution defined by 



Y (x) = sum of Y(Xq) for x^ x . 



For instance, if y(x) denotes the number of individuals 
having an annual income of x dollars (more •; exactly , 
x+b dollars for some conveniently chosen .small increment 
br, then Y(x) denotes the number of individuals with 
income at least x dollars. The latter cumulative distri-- 
bution is exact in the sense that it presents che actual 
number of individuals belonging to the corresponding 
category of the sample, whereas the frequency distribution 
presents grouped data as a substitute for frequency 
density functions and therefore potentially introduces j 
error into the sample data. For this reason it is often 
desirable to analyze cumulative sample distributions 
rather than the corresponding approximate frequency function 

Consider, therefore, a cumulative distribution {(x,Y)} 
whose points fall close to a straight line when exhibited 
on log-log graph paper. In this event it is reasonable 
to conclude that log Y = a log x + log c so 

Y = cx^ : (3.35) 



Y is a power function of x. The frequency function can 
be retrieved from (3.35) from the relation 



y = dY/dx ; 



we find 



y 



cax 



a-1 



/ 



so y is also a power function of x. 

If the entire range of the variable x is not represented 
by the sample data, this procedure of determining the 
theoretical frequency function from its cumulative distri- 
bution by differentiation can be misleading. Figure 3.7 



LOGARfTHMtC 



Figure 3,7 

DISTRIBUTION OF INCOME 




81 



Er|c 89. 



displays the cumulative income distribution for Great 
Britain in 1893-94 and for the United States in 1968 
drawn on>log-log graph paper. The data for the former 
fall along a straight line which impl ies that it and 
hence also the associated frequency function are power 
functions- This is the famous *law^ of Vilfredo Pareto 
(Ref, (G), vol.2, p, 304 o.t seq . ) - The leftmost part 
of tne corresponding distribution for the United States 
also exhibits a generally linear trend but the rightmost 
portion cannot be so construed at all. Several interpre- 
tations of this anomoly are possible, including some that 
are based on variations between the underlying economic 
and social structures of the two nations during the two 
time periods surveyed, but it is possible to account 
for the apparent contradiction by examining the extent 
to which the sample data represents the entire range of 
variation . 

The United States data refers to income of any size 
reported on tax returns, of which more than 73 million were 
tallied. The data used by Pareto refers only to incomes 
greater than 150 pounds sterling per year , of which 400 , 648 
were reported. The number of inhabitants per income re- 
ported was nearly 3 for the United States in 1963; 
using this figure, we see that if Pareto's data includes 
essentially all incomes, the population of Great Britain 
in 1893-94 should have been about 1,2 million. It was 
in fact perhaps greater than 8 million, which suggests 
that there were possibly more than two million people 
in Great Britain in those years having a positive income 
less than 150 pounds per year. The reader should not 
interpret this estimate as anything but a very crude 
indication of the number of incomes that were probably 
overlooked in the data sample. Now apply this estimate 
to extend the graph for Great Britain in Figure 3,7; 
the extension must turn downward when the total number 
of incomes exceeds about 2 , 4 million , so the extended 
income curve will have the same general appearance as 
that for the United States. 

As has already been remarked, the cumulative distribution 
of income function for the United States certainly is 
not a power function and therefore the corresponding 
frequency functior cannot be either* By plotting the 
grouped frequency data published by the Internal Revenue 
Service on lognormal probability graph paper, it can be 
seen that the frequency function of income distribution 
can be approximated throughout its entire range by a 
lognormal function with the parameter a of eq {3*22) 




93 - 

82 



approximately equal to $4,000. It is likely that the 
income distribution for Great Britain used by Pareto 
can also be approximated by a lognormal function , but 
it is necessary to have an accurate estimate of the total 
number of incomes less than 150 pounds per year before 
one can calculate the cumulative fractions necessary 
for the employment of lognormal probability graph paper 
or the methods for estimating the parameter values from 
sample data given earlier in this Chapter. 



References 



1. Estoup, J. B. , Gamines s tenographiques , 4th Edition, 
1916. 

2. Bradford, S. C., "Sources of Information on 
Specific Subjects", Engineering, (1934), January 
26. 



3. Mandelbrot, B. "An Information Theory of the 
Statistical Structure of Language", Proceedings 
of the Symposium on Applications of Communication 
Theory , ButterworthT, 1953 . 

4. Mandelbrot, B., "On the Language of Txonomy: an 

Outline of a ' Thermostatistical ' theory of Systems 
of Categories with Willis (natural) Structure", 
Information Theory? Papers Read at a Symposium o n 
Information Theory , London 1955 , Butterworth , 1956 , 
135-45. 



5. Zipt, G. K. , Human Behavior and the Principle of 
Least Effort , Addison Wesley, 1949. 

6. Pareto, V., Cours d'economie, politique, Lausanne, 
1897 . 



7. Lotka, A. J., "The Frequency Distribution of 
Scientific Productivity" , Journal of the Washington 
Academy of Sciences, 16 (1926) , 317 . 

8. Price, D. J. de solla, Little Science, Big Science , 
Columbia University Press, 1963 . 



Schrodinger, E. , Statistical Thermodynamics , 
Cambridge University Press, Cambridge, 1964 • 





83 



9. 



10 



. Yule, G. U., The Statistical Study of Literary 
Vocabulary / Cambridge : The University Press , 1944 ^ 

11. Belonogov, G. G., "On some Statistical Regularities 
in Written Russian", Vopr . Ja zykoz nani ja , 7(1962), 
100, (in Russian). 

12. Good, I. J., "Statistics of Language: Introduction", 
En cyclopaedia of Linguistics, Information, and 
Control , Pergamon Press, London , 1969, 567-81. 

13. Houston, N. and E. Wall, "The Distribution of Term 
Usage in Manipulative Indexes " , American Documen- 
ted^, 15(1964), 105-14. 

14. Fairthorne, R. A., "Empirical Hyperbolic Distri- 
butions (Bradford-Zipf “Mandelbrot ) for Bibliometr ic 
Description and Prediction", Journal of Documenta- 
tion , 25(1969) 319-43. 

- 15. Wall, E., "Further Implications of the Distribution 
of Index Term Usage", Paraineters of Information 
science: Proceedings of the American Dccumentation 
Institute Annual Meeting, 1964, Volume 1 , Ameifican 
Documentation Institute, 1964, 457-66. 

16. Carroll, J. B., "On Sampling from a Lognormal 
Model of Word Frequency Distribution", Computa - 
tional Analysis of Present-Day American Engll.sh 
THenry Kucera and W. Nelson Francis^ , Brown Univer- 
sity Press , Providence, Rhode Island, 1967, 406-24. 

17. Kucera, H., and W. N. Francis, Computational Analy- 
s is of Present-Day American EngFish , Brown Univer- 
^ty Press , Providence, Rhode Island, 1967 - 

18. Shannon, C. E-, "Prediction and Entropy of Printed 
English", Bell System Technical Journal, 30(1951), 
50. 

19. Dolby, J. L. , V. J. Forsyth, and H. L. Resnikoff, 
Computerized Library Catalogs: Their Growth, Cost , 
and Utilrty , M. I .T. Press , Cambridge, 1969 . 

20 . Foarson, E. S. , and Hartley, H. O. , Biometrika 
Tables for Statisticans , Volume I. , Cambridge , 
England : The University Press , 1954 . 

21. Pearson, Karl, "Mathematical Contributions to the 
Theory of Evolution" , Philosophical Transactions , 

A, 186 (1895), 343-414; T97 (1901) , 443-59; “216 (1916) , 
429-57. 




8 ^; 

92 



22. Cramer, H. , Mathematical Methods of Statistics , 
Princeton University Press, Princeton, 1946 . 

23. Dolby, J. L. , and H. L. Resnikoff, "On the Struc- 
ture of Written English Words", Language, 40(1964), 
167-96. 



I O 

I ERIC 



93 

85 




CHAPTER IV 



THE STRUCTURE OF 
BACK OF THE BOOK INDEXES 




94 



THE STRUCTURE OF 



BACK OF THE BOOK INDEXES 



Book indexes are among the most common and most 
ancient access mechanisms , although they have not 
always been loved. Glanville, in Vanity of Dogmatizing » 
said: 

Me thinks • tis a pitiful piece of knowledge 
that can be learnt from an index , and a poor 
ambition to be rich in the inventory of another's 
treasure , 

and more recently T. E. Lawrence wrote: 

. . .half-way through the labor of an index 
to this book I recalled the practice of my 
ten years' study of history; and realized 
I had never used the index of a book fit 
to read,. 

However, as an unnamed contributor to a recent edition 
of the Encyclopedia Britannica put it, 

(it has) become almost a sine qua non that any 
good book must have its own index. 

Indeed, as we shall see below, more than one-third of 
all non-serial items in the shelf list of a medium 
size university library ^ contain an index, and it 
seems as if the back of the book index is not only 
here to stay but is in the process of spawning a genus 
of related tools for indicating ** th3 position of 
information on any given subject". 

The object of this chapter is to study indexes to 
books in order to determine what structure , if any , they 
possess. It is not surprising that indexes* exhibit 
great variability in size, content, and utility, which 
makes it difficult to assess their nature in general from 
an examination of one or several exemplar's. We have 
elected to study indexes in three ways. 




^Throughout this chapter 'index' will only refer to 
back-of-the-book indexes . 

. f s 

JV, 

86 

95 ^' 



The first and most reliable way is based on the selection 
of a random sample of book indexes. Such a sample has 
been assembled by extraction of the indexes from all 
monographs represented in a random sample of the shelf 
list of a medium size university library; it consists 
of approximately six hundred thousand index terms 
spread throughout some 700 books ^ and will be described 
in what follows : 

The second means of studying indexes is concerned with 
the structure exhibited by each index separately. 
Information of this sort cannot be obtained from statis- 
tical agglomerations; rather it demands that indexes 
be considered in detail and the resulting structures, 
if any are found, compared for a sample of indexes. 

A book index directs the user to the location of 
specified information in the book to which it refers. 
Should the book in question not contain any indexed 
information about the subject of interest, the inquirer 
is left to continue his search in the indexes of other 
unspecified books. There are, of cou7.*se, several indirect 
methods for deciding how the next book in the search 
process should be selected, utilizing information con- 
tained in the bibliographies or the linear shelf list 
order determined by a subject classification scheme 
such as that of the Library of Congress, but none of 
these have the virtue of immediacy nor of completeness. 

Our third means of studying indexes is based on a 
c umu 1 a t i ve index to 80 books in the field of statistics. 

It appears to offer attractive efficiencies in the 
information search process while it provides a view of 
the overall structure of the field itself. 

The Fondren Index Sample is a random sample of 668 
monograph shelf list cards corresponding to indexed 
books. Multiple volumes catalogued on one shelf list 
card increase the sample somewhat so that a total of 
706 indexes are represented. 

The Fondren Index Sample is a subsample of the Fondrer 
Sample , which is a random sample of cards drawn from 
the shelf list of the Fondren Library at Rice University. 
The Fondren Sample is described in some detail in 
Reference [ 1 H . Analyses of the sample may be expected 
to accurately reflect the structure of library collec- 
tions to the extent that they are similar to the 
Fondren collection; in particular , the archival 
collections of medium size university libraries are 
probably generally similar although certain special fields 




96 




may be more or less well represented. For instance, 
the Fondren collection is particularly weak in law, 
medicine, and Russian language and literature, and strong 
in chemistry. These differences are unlikely to play 
a significant role in determining the reliability of 
the sample for studying index structure since indexes 
are relatively insensitive to the nature of the sub j ect 
material to which they refer ; the gross category 
differences, as between science and fine arts, are, 
as will be shown below, substantial , but the Fondren 
collection encompasses adequate representation in 
each of such broad categories. 

There are special problems associated with the analysis 
of complex data drawn from any sampling process . The 
index sample is no exception. Some of the sample indexes 
have a format so unusual as to make them incomparable 
with the average index? a small number were written 
in non-Roman alphabets so we were unable to correctly 
identify the structural features of interest. Because 
the fraction of anomolous indexes was small, it was 
decided to delete them from the index sample for this 
initial study. 

This decision was bolstered by another complication? 
not all of the books represented by the original random 
sample could be located for the present study , which 
took place about two years after the original selection 
of shelf list cards. The number of unlocatable items 
was 33, approximately 1.7% of the Fondren Sample? this 
is the effective rate of loss for the two year period 
in the sense that the usual mechanisms for tracking 
items not present on the shelf in their proper location 
were applied without success for these items, noting 
that just prior to the selection of the sample the 
shelf list had been checked against the shelf and weeded. 
This suggests that slightly less than 1% of the monograph 
archive is lost each year. 

If all 33 unlocatable items had had indexes, they would 
have constituted nearly 4% of the index sample? items 
excluded for special reasons such as language or format 
incompatibility totalled 22. Therefore, not more than 
7.5% and more likely not more than 4.5% of the indexed 
volumes in the Fondren Sample have been excluded from 
the index sample. With this preliminary in mind we can 
now turn to the consideration of the index sample. 

First observe that not all monographs are candidates 
for indexing; we have found no Library of Congress 
class "A" items in the sample which contain an index, 



"88 

9Z 



o 

ERIC 



and therefore class **A" is excluded from all further 
considerations • Similarly , neither maps nor musical 
scores are indexible in the "back of the book" sense , 
so they too are excluded. Excluding these items and 
all serial publications , one finds that there are 1,830 
relevant items in the Fondren sample. Of these, 668 
have indexes; thus we find that 37% of the monographs 
in the Fondren sample contain indexes . 

As previously noted, the 668 LC cards lead to a total 
of 706 volumes with indexes . The distribution of these 
706 volumes by LC class is shown in Table 4 . 1 together 
with the fraction that is indexed for each class. 

This fraction runs from a low of 0.18 for N (Fine Arts) 
and P (Language) to a high of 0.61 for Q (Science) 
and 0.67 for Naval Science. 

Table 4.1 also provides the mean number of index 
entries per book indexed . The grand mean for the collec- 
tion is 836 index entries per book, with the class means 
varying from a high of 1 , 391 entries per book for class F 
(U.S. Local History) to a low of 614 for class J 
(Political Science) , 

The product of these two measures provides an average 
measure of the amount of access per book in the collec- 
tion and in each of its subsets. This distribution is 
shown separately in Table 4.2. This list breaks 
rather naturally into three subsets of nearly the same 
size. The first seven categories (classes F, G, V, 

K, D, E, and Q) would seem to share the property that 
they are all primarily concerned with careful descrip- 
tion of the world as it is and as it has been. The 
middle group (classes H, C, R, T, Z, L, and J) is 
primvarily devoted to man's effort to cope with the 
environment described so carefully in the first group. 

The lowest group appears a bit anomolous in that it 
contains the core of the arts: music, philosophy, 

religion, language, literature, and the fine arts as 
well as the more mundane but ever present categories 
of war and agriculture. Although we should not like 
to make too much of this particular arrangement of 
the LC classes. Table 4.2 does provide an interesting 
example of the insight one gains into the use of the system 
of literary stores by rather elementary counting procedures. 

The index sample consists of a total of 590,329 index 
entries spread across the 706 indexes. Table 4.3 J.ists 
the number of indexes as a function of the number of 
entries they contain, grouped by hundreds of index 
entries. Figure 4.1 exhibits the lognormality by showing 
the data of Table 4.3 plotted on lognormal paper. The 
standard deviation on the log scale is 0.442 which is at 
the upper end of the range for log-length distributions 
given in Chapter II. 




Table 4.1 



FONDREN SAMPLE: FRACTION OF SAMPLE ITEMS 

CONTAINING AN INDEX, BY LC LETTER CLASS 



Class 


Mean Number 
of Entries 
per Index 


Fraction 

Indexed 

(rounded) 


Fraction Class 
is of 

Fondren Sample 


Short Class Name 


B 


667 


.31 


. 100 


Philosophy-Religion 


C 


690 


.53 


.009 


History-Auxiliary 

Sciences 


D 


1,102 


.51 


.095 


History & Topography 
(except America) 


E 


1,062 


.49 


.040 


American (General) 
& U.S. (General) 


F 


1,391 


.46 


.027 


United States (Local) 
& America (ex. U.S.) 


G 


1,264 


.50 


.011 


Geography-Anthropology 


11 


697 


.54 


.104 


Social Sciences 


J 


614 


.46 


.023 


Political Science 


K 


1,375 


.43 


. 004 


Law 


L 


620 


.49 


.038 


Education 


M 


915 


.25 


.015 


Music 


N 


615 


.18 


.033 


Fine Arts 


P 


714 


.18 


. 300 


Language & Literature 


Q 


850 


.61 


.093 


Science 


R 


716 


.50 


.010 


Medicine 


S 


638 


. 20 


.006 


Agriculture-Plant & 
An ima 1 Husbandry 


T 


707 


.47 


.032 


Technology 


U 


840 


.22 


.010 


Military Science 


V 


934 


.67 


.005 


Naval Science 




1,328 


.24 


.023 


Bibliography & Library 
Science 




Total relevant 


items in 


Fondren Sample = 


1823 



Number of these items indexed = 668 

Fraction indexed = 668/1830 = 0.37 




90 

99 Mf) 



Table 4.2 



INDEX ACCESS BY LC CLASS 
Mean No . 

LC Index Entries 

Class per Book Short Class Name 



F 


640 


U. S. Local History 


G 


632 


Geography 


V 


626 


Naval Science 


K 


591 


Law 


D 


562 


World History 


E 


520 


U. S. History 


Q 


519 


Science 


H 


376 


Social Science 


C 


366 


Auxiliary Sciences (History ) 


R 


358 


Medicine 


T 


332 


Technology 


Z 


319 


Library Science 


L 


304 


Education 


J 


282 


Political Science 


M 


229 


Music 


B 


207 


Philosophy-Religion 


V 


185 


Military Science 


P 


129 


Language Literature 


S 


128 


Agriculture 


N 


111 


Pine Arts 




91 



Table 4 . 3 



FREQUENCY OF INDEX ENTRIES FOR ITEMS 
IN THE FONDREN INDEX SAMPLE 



Number of 
Index 
Entries 


Number 

of 

Indexes 


Cumulative 
Number 
of Indexes 


Cumulative 
Fraction 
of Indexes 


0 


- 


99 


16 


16 


.023 


100 


- 


199 


77 


93 


.132 


200 


- 


299 


83 


17x5 


. 249 


300 


- 


399 


80 


256 


.362 


400 


- 


499 


70 


326 


.462 


500 


- 


599 


62 


388 


. 549 


600 


- 


699 


46 


434 


.615 


700 


- 


799 


37 


471 


.667 


800 


- 


899 


39 


510 


.722 


900 


- 


999 


30 


540 


.765 


1000 


- 


1099 


17 


557 


.789 


1100 


- 


1199 


24 


581 


.823 


1200 


- 


1299 


13 


594 


.841 


1300 


- 


1399 


14 


608 


.861 


1400 


- 


1499 


8 


616 


.872 


1500 


- 


1599 


2 


618 


.875 


1600 




1699 


13 


631 


.893 


1700 


- 


1799 


7 


638 


.903 


1800 


- 


1899 


7 


645 


.913 


1900 


- 


1999 


7 


652 


. 923 


2000 


- 


2099 


7 


659 


.933 


2100 


- 


2199 


3 


662 


.937 


2200 


- 


2299 


5 


667 


.944 


2300 


- 


2399 


1 


668 


.946 


2400 


- 


2499 


1 


669 


.947 



o 92 




Table 4 . 3 
(Continued) 



2500 - 2599 
2600 - 2699 
2700 - 2799 
2800 - 2899 



3000 - 3099 
3100 - 3199 
3300 - 3399 
3500 - 3599 
3700 - 3799 
3800 - 3899 
3900 - 3999 
4000 - 4099 
4200 - 4299 
4700 - 4799 
4900 - 4999 
j 5100 - 5199 

jf 5900 - 5999 

I 6200 - 6299 

r 6700 - 6799 

I 7000 - 7099 

f- 




671 


.950 


674 


.955 


677 


.959 


678 


.960 


681 


.964 


683 


.967 


684 


.969 


685 


.970 


687 


.973 


690 


.977 


692 


.980 


693 


.981 


694 


.983 


697 


.987 


70 0 


.991 


701 


.993 


702 


.994 


704 


.997 


705 


.998 


706 


1.000 


93 









2 

3 

3 

1 

3 

2 

1 

1 

2 

3 

2 

1 

1 

3 

3 

1 

1 

2 

1 

1 



PERCENTAGE 




5 

A 

3 



2 



I 

9 

8 

7 

G 

5 

4 

3 



2 



1 



A distinction should be made between the number of index 
entries in an '' ndex and the number of locations to which 
these entries efer. The former quantity is the number 
of distinct word sequences appearing in an index, 
and is an absolute measure of index size which is 
independent of the details of format and page composition; 
the latter is usually the number of page locations re- 
ferred to in an index, which clearly depends on the size 
of the page. In the Fondren sample of indexed books 
there are, on the average, 1.8 page locations per index 
entry. Thus, the 836 (average) distinct entries refer, 
on the average to 1,505 text locations. As there are 
on the average 341.5 pages per indexed book, there are 
4.4 indexed text locations per page. Roughly speaking, 
this means that there is one index page location for each 
five sentences of text. 

The aggregate size of the index as printed can be determined 
by estimating the average number of characters per 
entry and multiplying by the average number of entries. 

A preliminary estimate of the average number of characters 
was obtained by counting the entries in the cumulative 
index to statistical books (discussed at greater length 
in Chapter VI) as the format of the material is in partic- 
ularly nice form for counting purposes. This estimate 
shows that the entries are about 25.47 characters in 
length exclusive of page location information. If, as 
in Chapter I, this is augmented by 4 characters per entry 
to include the typical page location reference information, 
then the average index of 836 entries consists of 24,637 
characters and therefore the ratio of indexed book size 
to index size is about 33.27 to 1. 

These global statistics provide a direct measure of the 
proportion of the monograph collection that is devoted 
to what might be called "self access". The aggreement 
of the access ratio (of about 30 to 1) with other access 
ratios developed in Chapter II helps to solidify the 
foundations of the level structur'ed access model. Given 
the difficulty of assessing the quality of indexing 
(see (2) and the references therein) these statistics 
also provide the foundation of a basis for comparing 
various indexing procedures, particularD.y for comparing 
algorithmically derived indexes to manual indexes. The 
fundamental regularities of the length measures discussed 
here suggest that an algorithmically prepared index 
must at least be of the correct overall size to be of 
any use at all . 




104 

95 



The find structure of the individual indexes can presum- 
ably shed more light on the situation. For these purposes, 
we have selected a random sub-sample of 28 indexes from 
the main Fondren Index Sample. For each of these indexes 
we have determined the distribution of the number of 
entries with one, tvo, three... page locations per entry. 

This distribution is comparable to the "frequency of 
frequencies" problem discussed extensively by Zipf, Bradford, 
Mandelbrot, et al (see Chapter III). Were the index an 
extractive index (i.e., one that is derived by extracting 
sequences of words from the text and inserting these 
sequences without change in the index) and were the page 
locations explicitly tied to the position on the page so 
that multiple occurrences of the entry on a single page 
would occur multiply in the index, then it might be antici- 
pated that the text location distribution of index entries 
would be Zipf -Mandelbrot distribution which would arise 
from the phrases which are the index entries in the same 
way as the usual Zipf distribution arises from text word 
occurrences . 

However, indexing practice normally requires a set o5 
sophisticated transformations from the running text to the 
index and also reduces multiple entries on a page to a 
single page location. Further, not all "phrases" are 
indexed and it would appear that those v/hich are left out 
are among both the most frequently occurring and least 
frequently occurring. Nevertheless, it seems reasonable 
to approach the problem at the first order of approximation 
by assuming a model of the Zipf -Bradf ord-Mandelbrot type; 
i.e,, by examining the form of the distribution on log- 
log graph paper. This has been done for all 35 of the sam- 
ple indexes, all 28 of which are presented here (Figures 4,2). 
(The remaining graphs appear in Appendix II.) The plots 
are given in the converse form to that used by Zipf in 
order to provide the converse form to that used by Zipf 
in order to provide stability (see Kendall (3) ) . Thus the 
largest point on the graph represents the number of index 
entries with single page references rather than the 
number of page references for the most frequently 
referenced item. 

Two graphs shown are typical for tha sample as a whole. 

In almost every case a straight line provides a reasonable 
approximation, with slopes ranging from roughly -1.1 
to 05.5. Thus the Zipf -Mandelbrot approximation holds 
well for index location frequency distributions. The 
importance of the slope as a parameter of index measurement 
can be seen by recalling the Mandelbrot formulation 
which maximized the expected information per unit effort; 
the reader may find it useful to compare e.g. (3.5) ff: 



96 




- Tigure 4.2 

Number of Index Entries vs . Number of Page References 



saiHOiNa xacNi ao naawnN 




cn 04 O »H 

O O iH 




saiHJiNa xaaNi ao naawnN 




ro gV4 O rH 

o O > iH 

fH • iH 



97 



NUMBER OF PP.GE REFERENCES NUMBER OF P^GE REFERENCES 



I 



- Y. p (x) log n 
p (x) log X 



(4.1) 



The function that maximizes this ratio is the Zipf- 
Mandelbrot distribution : 



p (x) = C X 

Substitution of (4,2) into (4.1) yields 



(4.2) 



I 



Y (log c - s log x) cx 
E cx ^ log X 



Ex ^ log Ex 
Ex ^ log X 



where all logarithms are to the base e and the summations 
extend from 1 to the maximum number of page references 
per index entries. 

For s greater than oner the summations all converge to 
functions of the Riemann zeta function as the maximum 
number of page references per index entry increases. 

Hence r with the sums running overall positive integers. 



I = s 



r, (s) log r, (s) 
r.' (s) 



(4.4) 



An s increases^ the ratio on the right, in turn, converges 

to (log 2)""^ = 1.443 so that a first order approximation 
to Mandelbrot information for the Zipf -Mandelbrot form 
is given by 



I = s + 1.443 



For s greater than or equal to 3, the error is less than 
10%. In other words, to a first order approximation, 
M andelbrot ' s measure of information per effort is 
cTirectly proportional to s, the negative value of the 
slope of the approximating straight line on log-log paper . 

For data that perfectly fits the Zipf-Mandelbrot model, 
the parameter s can be determined from the relation: 



ERIC 

hiaimifftiiTiaaiJ 



107 



s 



log (number of references with single page locations ) 
log (number of page locations of most popular index entry) 



clearly, the greater the number of single page location 
index entries and the fewer the number of multiple 
page location index entries , the greater the estimate 
of s and hence the greater the amount of information 
per effort under Mandelbrot's definition- In the extreme 
case, where each index entry refers to one, and only 
one, page location, Mandelbrot information is infinite - 
Although we have found so such indexes in the Fondren 
sample, it is well to note that dictionaries take this 
form: each main entry occurs once and the referent 

information is conveniently packaged w: th the main 
entry itself rather than through a page location to some 
other* source - 

The values of s for each index in the subsample are 
listed in Table 4,4 in decreasing order of s- Earlier 
in this chapter we organized the various monographs 
by LC class and then by total number of entries per 
monograph- Under this measure the LC classes fell into 
tluree disjoint sets corresponding roughly to the descrip- 
tive materials, the technique materials, and the arts - 
The average slope for each of these three groups are 
respectively, 2-83, 2-37, and 2-79- The differences 
between the means are not only insignificant statistically; 
they do not even provide a corresponding ordering, were 
they significant- Thus the slope (and hence Mandelbrot’s 
measure of average information per average effort) 
provides an independent measure of the index - 

The 28 values of s are plotted in Figure 4-3 on log- 
normal paper- The distribution of values is reasonably 
approximated by a straight line as might be expected 
since as we have now shown? s is a normalized measure 
of information - 

However, except for specialized indexes such as diction- 
aries, multiply occuring entries do occur, thus depressing 
the information ratio- For the sample plotted in Figure 
4-3, the average value of s is 2-66, quite close to the 
natural constant, e, which is 2-718- As these multiply 
occurring entries do reduce the information ratio by 
increasing the effort required^ it is appropriate to 
inquire as to what role they play in the index. 

Some hint as to the nature of this phenomenon can be 
obtained by examining the role of the multiply occurring 
entries in the context that Zipf first studied them; 



O 

ERIC 



99 : ' 

108 



Table 4.4 



Z IPF-MANDELBROT EXPONENT 
FOR INDEX LOCATION DISTRIBUTION 



LC Number S 



PT7244 


5.47 


QD9 


4.43 


HB199 


4.33 


BV2532 


3.81 


DA690 


3.54 


TK153 


3.53 


E741 


3.39 


Q391 


3.20 


QA303 


3.06 


E178 


2.81 


QL703 


2.71 


RM7 21 


2.71 


BF181 


2.69 


DF521 


2.64 


Z5782 


2.49 


ND553 


2.26 


HM66 


2.15 


F864 


2.08 


PR2831 


1.96 


LB875 


1.94 


LC191 


1.93 


HF2046 


1.86 


D443 


1.81 


HD20 


1.71 


PR5588 


1.69 


PN2598 


1.67 


DS423 


1.43 


JA84 


1.09 




109 

100 



PERCENTAGE 

2 % 6 10 15 20 30 40 50 60 70 80 85 90 95 98 % 




Zipf-Mandelbrot SIodp 



I 



^ 




S-. ■ 





in natural language itself . Even a cursory examination 
of a frequency ordered word list such as those prepared 
by Thorndike and Lorge (4) and Kucera, et al , (5) 

is sufficient to show that the most frequently occurring 
entries are the structure words (i.e. words with parts 
of speech other than noun, verb, adjective, and adverb). 

Such words provide the structure in which the information 
is embedded, but do not, at least in the broad sense, 
contain information themselves. Except for the rare 
case (e.g. in the use of certain prepositions in mathe- 
matical treatises ) such words almost never occur in 
first position in an index entry. 

In this context, it seems natural to suggest that the 
index entries that occur with many page locations play 
a fundamentally different role from those that refer 
only to one or a few page locations. Roughly speaking, 
we might say that the multiply occurring entries carry 
the semantic structure in much the same way that the 
multiply occurring words carry the syntactic structure. 
Suppose, for instance, that the term California appears 
in an index with, say, 15 page locations. It would 
seem reasonable to conclude, even with no other informa- 
tion about the accompanying text, that the text is 
very much concerned with California in a global manner. 
Reference to each of the various page locations would 
presumably uncover a variety of bits of information about 
California and in this particular sense , we could say 
that California was one of the "subjects" discussed 
in the book. If on tlie other hand, we were to find 
another book, say on population statistics, whose index 
contained a single page location for California , it 
would seem appropriate to conclude that California was 
one of many items discussed in the text rather than a 
main subject of the text. 

In short, if one is interested in "population statistics 
for the state of California" one can either go to a book 
on population statistics and look in the index for 
California, or one can go to a book on California and 
look in the index for population statistics. For obvious 
reasons both types of information packaging exist and access 
to the packaged information is generally, though not 
always, provided both ways: by subject to allow the 

user to get to the proper book, and by index entry to 
allow the user to obtain the specif ic fact once he 
has gotten to the proper book. 

The multipl;/ occurring entries thus provide a sort of 
transition from the "specific fact". aspect of the problem 
to the "general subject" aspect of the problem. They 
provide the basis for an algorithmic identification of 



102 



111 



the semantic structure in the same way that the structure 
words provide a basis for the algorithmic identification 
of the author's syntactic style. (See Mostellor and 
Wallace (6 ) ) 

For both the word frequency distribution and the index 
page loc5ition distributions, there is no clear break 
between the set of frequently occurring items and the 
set of non-frequently occurring items. However, the 
previously developed arguments on the access level 
structure provide a technique for establishing break 
points in the distribution: the set of most frequently 

occurring entries can be defined as l/900th of the whole 
set of entries. This has been done for the subsample 
of indexes from the Fondren sample. The results are 
tabulated together with the LC class , the LC subject 
headings f and the title in Table 4.5. 

Looking first at the subject heading and title information 
in Table 4.5, it is clear that approximately two-thirds 
of the subject headings are direct transformations (through 
the subject heading authority list) of the title 
information. This observation, of course, sheds con- 
siderable insight into the discussion of the utility 
of permuted title indexes: anything as cheap as a permuted 

title listing that can supply in the order of two-thirds 
of the subject heading information automatical ly is 
clearly useful. At the same time a device that misses 
one-third of the potential information is clearly not 
sufficient . 

In this context the role of the multiply occurring index 
entries becomes more obvious: most of LC subject headings 

that are not derivable from the title information 
are derivable from the multiply occurring index entries 
eitTher directly (e.g. Andalusite, U.S.A. vs. Andalusite) 
or at a higher level of synthesis (e.g. gaseous discharge 
tube t ultra violet light 4- reaction, reactors vs . 
electrical apparatus^ and appliances ) . At this stage it is 
not necessary to re-open the much discussed question of 
whether classification of documents can be obtained 
economically through purely algorithmic processes; 
other simpler problems must be solved first (e.g. the 
automatic derivation of the index itself). However, it 
is essential to obtain a clearer understanding of how 
the various access devices already in operation interact 
with one another. The preliminary results derived from 
Table 4.5 make it clear that there is a direct relation 
between the LC subject headings, the monograph titles, 
and the; multiply occurring index entries. The utility 
of titl^ derived indexes is manifest by their present use 
and persistence. It remains to determine the utility of 



103 - 

112 ^ 



in 



0) 
I— I 

43 

fd 

H 









G 




































•H 


0) 


V) 














0) 


u 


0) 












>1 


M 


fd 


rH 




0) 








tn 


O 


0) 


rH 










o 


P.i 


P4 


'H 












rH 






> 




fd 








o 


0) >f 




44 












4i: 


4i: -p 




U 




•H 


0) 






u 


-M 0) 


u 


fd 




M 


43 






>1 


*H 


fd 


tn 




0) 


-P 






U) 


m o 


3: 






pr: 








P4 


o o 




0) 






tP-P 








tn 


U-f 


43 




rH 


G OT 






0) 


>1 


o 


-P 




fd 


•H fd 






> 


M >1 








M 


P4 






'H 


o u 


V) 


tJ 


6 




c 




0) 


-M 


-M fd 


u 


C 




-P fd 


fd C 




1— 1 


(d 


OT a 


fd 


(C 


'H 


rH *H 


-P fd 




-M 




•H O 


0) 




-p 




OT u 




•H 


U> 


P5 "vH 




0) 


c 


U J3 


M *H 




H 


0) 


OT 




rH 


fd 


M 


0) U 






-M 


0) V) 


c 


O 


N 


0) 


^ 0) 






C 


43 -H 


0) 


C 


>1 


4^ MH 


P3 i 






M 


H s: 


H 


X 




Eh O 


D < 






1— 1 


















fd 


















u 


















■H 














w 




U> 






V) 








w 




O 






44 






V) 


M 




1— 1 




rH 


fd 






0) 


P:^ 


V) 


o 




Ch 


o 




G 


w 


H 


tP 


'H 




rH 


C 


1 


O 


V) 


:2; w 




V) 




» 


0) >1 


0) 


•H 


0) V) 


pq u 


'H 


>1 




V) 


> rH 


M 


-P 


M 0) 




TJ 


x: 




U 


0) *H 


-»H 


fd 


TJ M 


X H 


fd 


P4 




'H 


w B 


Oi 


N 




W M 


0) 






-M 


fd 


e (3 


■H 


< -P 


Q H 








•H 


- Em 


pq o 


rH 


1 U 


:z; 




>1 




rH 


44 




•H 


• 0) 


M 








o 


U 0) 


0) -P 


> 


-P rH 




u 


o 




P4 


fd rH 


G fd 


■H 


V) 


?H w 


0) 


1— 1 




1 


pL| rH 


•H N 


a 


•iH ' 


a o 


■r-i 


o 




0) 


*H 


-P -H 


1 


PC V) 






4i: 




04 


(D • > 


13 r-H 


fd 


1 >1 


W M 


3 


u 




o 


rH 44 


fd ‘H 


•H 


* fd 


O Q 


W 


>1 




U 


O O 


N > 




cn V) 


a< 




V) 






S3 f3 fd 




c 


• OT 


w w 


U 


P4 




U 


X Ei3 W 


« a 


M 


D PC 


Ccj 




r 


0) 










Pm 






G 












1 H 




• 


O 


• 


* • 


• 


« 


• 


tc a 
V) w 
M 




1— 1 


Is 


1— 1 


rH CM 


rH 


rH 




tc m 










>1 








o 










-P 






>1 



Pu W 
O 

U 

O 

w aj 

W H 
H 

*3 12 



O 

a 



o 

-M 



0) 

C 

o 

■H 

-M 



<l) 4-1 *d 

u o 

(d 

tp m C 
M 0) -H 
(d 0) 



>1 

M 

*d 

“H 



U 

(d 

S 





< 




G 


s: 44 Q 


0) 












• CO 44 








0) 


G 0) 


rH 












<3 fd o 








e 


>1 P3 *3 10 


cu 












B -H 


V) 


• TJ 






O O • 


o 








fd 




10 O P 


0) 


s: *3 






(d u -H fd 


c 








44 




<15 43 (U 


M 


g 




■H 


PI 44 M 


■H 








•H 




rH Eh fd 


•H 


• B 


0) 


(d 


10 G O 


44 








O 




U (U 


-P 


12 & 


43 


44 


•^»d 0) MH 


C 








I 




fd ^ u 


C 


•H 


H 


•H 


0) M e MH 


•• fd 








fd 




43 P3 Pm 


PC 


-tn 




M 


rH Id ‘rl 


(0 44 


(0 






Id 




a o 




G 




PC 


rH 3 ^ rH 


0) (0 


0) 






> P3 




10 - 


43 


O - 


G 




-H M 44 a 


43 G 


o 






fd fd 


CO 


u u 


44 


44 Td 


0) 


44 


> 0) 0) 


o o 


U 


fd 








fd 0) (u 


O 


in 


rH 


fd 


44 44 P3 (1) 


U O 


(0 


c 


fd 


fd j3 5 


0 


M MH 13 


O 


M 0) 


rH 


0) 


O MH Id P3 




0) 


(0 


> 43 to fd 


u 


fd MH M 


cr> 


fd u 


fd 


M 


fd fd 43 C 


43 P3 


u 


M 


■H 


m -H 


0 


Q) Q) 0 


\ 


X Pm 




e 


cn ^ Eh 


a -H 


Pm 


X 


cn 


= > m 




PC £H 


rH 
























• * 


• 


• 


• 


• 


• 


• 


» 


• • • 


• 


ft • • 




rH CM 




rH 


rH 


rH 


CM 


f-H CM 


3 

4 

5 


VO 


f-H CM M 


CO 
























(0 
























fd 




CM 




















rH 


rH 


ro 




O 


rH 




CO 










a 


00 


in 


cn 


o\ 


CM 




CM 








00 




H 


CM 




VO 


in 














a 


Pm 


> 




C 


Pm 




W 








rH 


PI 


PC 


PC 


a 


Q 


O 




Q 








PC 



ERIC 



113 

104 



Jackson 2. U. S .-Hist. -Historiography 



Table 4.5 (Continued) 

LC Class l/900th Entries LC Subject Heading s Title 



a 

u o 

O -H 

M-l -P 











d 


O 












x; 0) 


0) 










e 


o e 


-p 






Id 




o 


P 0) 


o 






•H 




d 


Id tp 


p 






C 




o 


0) Id 


04 






U 




o 


03 d 




o 


o 




w 


0) Id 


o 


>1 


0 








pc; 


-p 


P 


Oi 


•H 








O 


W 






d 


03 fH 


d 


0) 




Id o 




•H 


d Id 


p 


x; 


c 


U -H 




U 


O 'H 


0 


Eh 


nj 






0) 


•H p 


-p 




o 


03 -H 




0) 


-P -P 


0) 


f-H 


•H 






d 


Id 03 


pt^ 


Id 


U 


Id 0) 




-1-4 


P 0 




•H 


0) 


N CL, 






Z> 


0) 


o 




C X 




d 


04 d 


x; 


o 


•51 


< W 




w 


O M 


EH 


CO 












d 














o 














•H 












1 


-P 




• 








-p 


O 




4J 


r; 






d 


0) 






Id 


• 




x; (u 


-p 




<U 




-P 




o e 


o 




o 


• 


0) 




P 0) 


p 






0) P 


•H 




Id tn 


04 -P 




x: 


TJ o 


tr; 




0) Id 


•H 




4J 


03 


1 




03 d 


p 




o 


Id 0) 


o 




0) Id 


d • pq 






-P ^ 


o 






Id 0) 




1 


^ 03 1 


03 






Td • 




• 


• "H id 


•H 




03 iH 


0) Id 4J 




4J 


O -P “H 


V 


03 


d Id 


^ U O 


>1 


W 


•H rJ C 


c 


o 


o -H x; 


Id Eh I 


tn 


•H 


03 Id P 


Id 


•H 


•H p O 


P I 


o 




m o M 


u 




-P 4-> p 


-P 0) MH 


f — 1 


1 


iP 0) 




o 


Id 03 Id 


OJ IP 


o 


• 


Id C -H > 




d 


U 0 Q) 


0) P -H 


•H 


CO 


03 Id iH Id 


d 


o 


0) T3 03 


0) fx4 P 


O 


• 


q 0 iC u 


Id 


o 


04 d 0) 


P t Id 


o 




h> o -P 


CO 


P4 


O M p:^ 


fp 1 EH 


CO 




• 








P 


« 


















I— 1 








o 


iH 


04 


O'? 


iH 




iH 04 


rH 


04 


iH 










MH 




























i-H 




























Id d 

p o 


























<U 


0 ’H 


























a 


P P 




















rH 






o 


I-H id 




















Id 






o 


d fH p 




















u 




•• 


d 


O 03 o 












TJ 








0 •• 




X 


•H 


P -H -H 












0) 








P 03 




Id 




o P tn 03 












P 








rP d 




EH 


i-H 


tn 0 ) 0 ) 










03 


Id 03 








0 O 




'• 


Id 


0) (tJ M P 










P 


M p 


• 






O -H 


d 


0) 


d 


g Id 








0) 


u 


d fH 


h) 






■H P 


Id 


B 




o P 








d 


Id 


g d 








M Id 


o 


o 


•H 


O •• iH 








0) 


.d 


■H 03 


•k 




03 




•H 


o 


> 


d d Id 








C3 




03 0) 


d 




d 


Id 0) 


P 


d r-4 




•H o P •• 










o 


P 


*H 




o 


pt^ 


0) 


M Id TJ 


•H d 03 










d 


d 


Id 




•H 




§ 


0 


d 


P P TJ 








x; 


Q) 03 


o w 


iH 




d 


•• d 




r-H TJ 


•H 


03 Id fH Id 








> 0) 


03 


P 




D 


0 ) tn 


1 


Id -H 




P fH d o 








p 


Q) iH 


to 0 ) 


flJ 






(U -H 


o 


p > 




0) 03 C) P 








O ft 


p d 


P 




0) 


0 0) rH 


0) -H 


• • 


g "H -H rH 


0) 






p 


Id g 


P -H 






TJ 


•H p 






p tn P -H 


03 






p 


0) Id 


Id 0) 


Id 




Id 


M O 


d 


(U c 


Id 


(d Q) Id 


Id 






0) 


P X 


P d 


P 




p 


^ Oi 


•3 


Em -H 


EH fi4 iH < Pf^ 


s: 






EH 


m w 


u P 


u 




EH 


• • 




• 


• 


• • « • 


• 






• 


• 


• 


» 




• 


H 04 




ro 




5 

6 

7 

8 


rH 






iH 


04 


iH 


rH 




iH 



rH 




o> 

a\ 


o 


VO 

C' 




KD 


iH 


Ol 


Ol 




CO 


m 


Q 


Ph 


w 


Ph 


w 


W 


W 



.114 5 '': 

105 



JA84 1. Economy 1. Political Science- Russian Political Thought 

Hist. -Russia 





















0 






















in 










a 












00 










o 












rH 










•H 


rH 




















4J 


IT) 










0 0 


ID 








IT) 


•H 










•H P 


rii; 








U 


U 


0) 






MH 




(0 










O 


x; 






0 P 


rH to 


^ 0 










CO 


4J 






0) 


0 a 


0 0) 








W 










0 -H 


P 0) 


ID tP 


















P rH 


a 0 


rH 0 








m 


0 


0 




0) 


U 0 


>l04 


to *H 








o 


IT) 


IT) 




rH 


IT) ^'0 


CO 


H 0 


















0 


to 


1 P 








(0 


0 


0) (0 




a 




^ - 


P ri< 






0) 




O (0 


^ M 




0) 


0 


0 0 


to -H 






iH 


0) 


•H W 


•H O 




w 


»d IT) 


ID 0 


0 'd 






4J 


•H 


4.) 0) 


M rH 






IT) 


to 


0 0 ) 






•H 


> 


IT) M 


m Q) 




>1 


m 0 


0) >1 


13 'd 










O tn 






0 


0) 


B C 


rH 








O 


0 O 


0) o 




0 


0) a 


0) c 


0 ID 








> 


^ M 


XI IT) 




IT) 


p 0 


gs 


0) P 










M PH 






Ph 


eh p: 


Q CO 












1 


1 




• 1 
















00 +i 


in 




a a 










0) 






00 u 


CM 




CT) 










tp 


IT) 




rH <U 


cn 




•fH 'iH 




U 






a 






X3 


1-1 




1-1 rH 




-iH ^d 












^ o 






rH P rH 




^ rH 












rH CK5 


•» 




•H (U -H 




rH 0 






IT) 


P 




0) 


0 




.—I ^ 
*S. *r1 pi. 




ID 






0) 


4J 




u 


IT) 




rH to 




0 ^ 






W 


CO 




u to 


0) 




0 ^0 




CO 0 








1 




IT) M 


h3 




(U to CD P 




ID 






4J 


0) 




0 (u 






IH U U 




rO 




<]) 


U 


C U 




XI x» 






IT) rd IT) IT) 




0 0 






0) 


O P tyy 




- O 0 


>1 




0) 0 <0 0 




ID 'H 




a 


•r^ 


•H 4J C 




ruh) 0) 


rH 




Oi IT) CUOi 




'd 




•H 




4J IT) -r-l 






(U 




to :o 1 




to >1 0 








IT) M XJ 




i3 ^ u 1 


0 




0 ) 0 0 ) • 




'd M IT) 




a 


CO 


O <U U 




XI 0) to in 


tP 




.1^ 0) P rH 




rH P rH 




o 




0 4J IT) 




O tP 0 CM 


0 




ID a ID p 




ID 0) 0) 




u 


O 


^ 0) 




0 IT) IT) <3> 


•iH 




P 0 P -H 




u 0 a 


1 






W I4 H 


0) 


Q U rH 


E^ 


0) 


CO f4 CO PQ 


0) 


CO O 4 H 




m 






0 






0 




0 






• 




• • 


O 


• • • 


• 


0 


• • 


0 


• • 


! 






rH CM 


0 


rH CM ro 




Is 


rH CM 


IS 


rH CM 




0) 






















1 — 1 






















XI 


















• 


t 


IT) 


















0 






















0 






















0 


f 








MH 


0) 










ID 


j 








O 


! 1 












i 




0) 






IT) 0^ 










•> 


i 




0) 




0) 


m - a 










rH 






•H 




(0 


(0 IT) 










tr) 


t 




M 




IT) 


^ P U 










a 


} 




-P 




0) 


0) O 0 




0) 






ID 


i 




C 




(0 


p4 rH Q 




0 






M 






w 




•iH 


04 0) ^ 




0 


U 




u 


1 








Q 


•H 4^ 




0) 


0) 


0 


ID 


i 




x; 






u 




•iH 


p 


P 


•m 






4J 




•* 


4J IT) 0) 




CM 


rH 


0 


m 


f 




o 




0 


CO m 4 : 






J? 


h) 




1 




o 


IT) 


0) 


EH 












i 




o> 


O 


P 


0) IM 










• m 


1 




\ 


•H 




^ 0) ^ 




<D 




to 


M 




1— 1 


U 


rH 


•H w d 




nH 


tn 


p 


u 


(■ 






0) 


•iH 


0 0) 




•P 


0) 


ID 


ID 


F- 

f- 

I 






1 


XJ 

u 


m >i> 
m w 




0 

m 


U 

0 


0) 


-r^ 

m 








• 




• 




• 






• 


1 






1— 1 


rH 


rH 




rH 


rH 


rH 


rH 


V- 




CO 


















tv 




(0 






















IT) 










00 


iH 


00 




•vv 

i'-'- 




iHI 


in 


rH 


m 




CJ> 


cn 


00 








O 


r- 


CTl 


in 




in 


CO 


in 


CM 


|- 






00 


rH 


in 




CM 


CM 


in 




1' 




U 




o 


Q 




IS 


01 




Eh 


1^; 








1-1 


§ 




04 


04 


04 


04 




QA303 1. Cauchy, A. 1. Calculus Vorlesungen Uber Differ 

2. Euler ential und Integral- 











>1 
































O 
















t— 1 


0) 














(d 


10 














P 


•H 










M 




g 


O' 


4^ 








O 




' rd 


M 


U 














0) 


0 


' m 












X 




0 rd 






0 




a 


w 










•rJ 




-H 




■P 


>1 rd 






-H 4J 


0) 




u 


(d 


x: M 






0 U) 


■P 


W 


•H 




O^Q 






O -H 


•H 


0) 


■P 


(/} 


rd 




0) 




C 


1— 1 


P 


a 


M iH 




I— 1 


>1 <D 


(d 


Qa 


0) 


0 


tn rd 




■u 


u x: 




•H 


P4 




0 > 




•H 


rd U 


-H 


U 


(d 


4J 


•H 0) 




tH 


P 


1— 1 


C 


M 


u 


iH -H 






,Q 0) 


iH 


-H 


0) 


0) 


4Q 'd 






•H x: 


-H 


M 


4 :: 


t— 1 


•H 0) 






-p 


W 


0^ 


tH 


w 


m s 














U) 


• 

1 — 1 












1— 1 


P 


43 












(d 


■P 


•H 












u 


rd 


P 












•H 


M 


1 






• w 






Td 


rd 


1 — 1 






iH 






0) 


Cli u) 


rd 




(/} 


JO O 






2 ; 


P4 0) 


> 






•H O 








rd O 


0) 





a 


PQ PQ 


0 






C 


-H 




•H 


1 


■P 0 




(/} 


i“H rd 


'd 


<D 




>1 0) 


•H 4J 




a 


rd -H w 


0) 


0 


(ti 


u o 


C *H 




•H 


0 iH C 


S 


a 


0 


■p a 


(d (/} 0 


(/} 


■P 


•H pii 0 




•H 


ti: 


0} 0 


e P -P 


1— 1 


(n 


M P 4 M 


s. 


4J 




•H M 


•H I— 1 *>H 


(d 


rd 


■P rd 4J 


rd 


a 


-p 


e 0) 


iH rd C 


g 


q 


O U 


e 


o 


o 


0) m 


tH td 


g 


P 


0) ^ 0) 


rd 


o 


0 


^ 0) 


•H a >1 


(d 




1 — 1 P 1 — 1 


M 


" — ' 


•r^ 


O 


w ^ o 


IS 


e? 


W rd pq 


P 


















in 


P 














• 


W 


• • 


• • • 


• 


• 


• • 


• 




o 


iH rsi 


iH rsi ro 


1— 1 


1— 1 


iH CN 


iH 


0) 
















1— 1 
















JQ 




























0) 














a 


tn 

M 

rd 

X 4J 

a 0) w 










< 




0 


to rH MW 






(/} 




rd • 




•H 


•H 0 0 a 






0 




•H cn 




4J 


-H 44 0 






•H 




• 




a 


1 > O M 






M 




n L> 




rd 


W ! fu 44 






4J 




M 




M 


0 rd 0) O 






a 










0 M pr; 0) 






w 




^ 0 




a 


0) 44 iH 










W 4J 


0) 


0 


W iH ^0) 










0 -H 


M 


0 


ni rJ a 






4J 


>1 


> W 


O 




tn iH 0 0) 


a tP 




o 


U 


U P 


> -- 


0) 


i-H -H 0 


0 >1 a 




o 


O 


(d 1— 1 


•H t/y 


1— 1 


- N 44 (u 44 a 


•H C tH 




cn 


tP 


0 (d 


a ^ 


a 


a) 44 ^ S o 0 ) 


W fl) -H tP 




\ 


<D 


w nd 


U 4J 


(n 


Xi M tJ' X rd nH 


w S 44 rd 




1 — 1 


U 


0) q 


(d (d 


PJ 


p 0) -H rd 0) (d 


rd 0 rd 44 






a 




U CQ 




ti5 s Pi > 


PM U P to 






• 


• • 














1— 1 


iH Cs| 


iH rsi 


1 — 1 


1-1 CN m -«i» m VO 


iH CN m ^ 




U) 
















<n 
















(d 






m 










1 — 1 




1— 1 


1 — 1 


m 


CN 




u 




o\ 


o 


<N 


ini 


CO 






cn 








1—1! . 


r- 




u 


P 


W 


§ 




in 




p? 


O 


o 


o 




tsi 


o 












11 K 




ERIC 

















index entries , over and above their obvious utility in 
providing access to a book's contents r once the book is in 
hand. This question will underlie much of the discussion 
in the next two chapters. 

Before turning to this question, however, it is useful to 
shed some light on how the indexer controls the multiplici- 
ties in the index and hence the value of s and the shape 
of the particular entries that will receive the highest 
numbers of page locations. Obviously, this can be done 
in several ways invvolving such delicate questions as the 
determination of how the indexer decides whether a particu- 
lar word, or sequence of words, on a particular page should 
rate an entry in the index. At a simpler level, the 
indexer has the opportunity to reduce multiplicities by 
increasing the length of the entry. Thus in a work on 
history, the indexer can either provide a single entry for 
war , with a large number of multiplicities, or he can 
break this same set of entries down into subsets involving 
particular wars such as civil war , world war , etc. 

That this mechanism is in fact used is easy to demonstrate. 
Table 4.6 provides the frequency distribution for the 27,188 
index entries in a uniform random subsanple of 35 indexes 
in the Fondren sample by word length. As might be expected, 
the distribution can be reasonably approximated by a log- 
normal distribution as shown in Figure 4.4. The arith- 
metic mean of this distribution is 3.68 words per index 
entry. Only 13.5% of the entries are one-word entries. 

This is somewhat larger than the 9.1% found in a smaller 
sample of indexes to statistical books studied by Dolby 
(7) but still provides strong support for the hypothesis 
advanced in (7) that the great bulk of the entries in 
back-of-the-book indexes are multi-word entriei>. 

This observation has considerable significance for the 
design of automatic indexing procedures. If one-word 
entries constitute only 13.5% of the total index, it 
seems unlikely that deteiiled frequency studies of words 
will provide much insight into the problem of deriving 
index entries automatically. In some of the earliest 
work on this subject, Luhn (8) attempted to derive indexes 
from word frequency counts, with limited success. More 
recently, Damerau (9) established a procedure for deriving 
coordinate index terms (to be used later via machine 
searches) based on word frequency counts . Bloomfield ' s 
(2) study of Damer?u's procedure makes it clear that coordin- 
ation of the single terms derived by Damerau rarely leads 
to an index entry derived by humans for the same material. 

As we shall show in the next chapter, there is more to 
be gained by deliberately suppressing the one-*wcrd entries, 
rather than by attempting to emphasize them. 



Table 4.6 



Distribution of Index Entries by 
Word Length - S ubs amp le of 
The Pondren Index Sample 



Number 
of Words 


Number of 
Entries 


Cumulative 

Number 


Cumulative 

Percentage 


1 


3673 


3673 


13.51 


2 


6563 


10235 


37.65 


3 


4817 


15053 


55.37 


4 


3905 


18958 


69.73 


r, 


2 839 


21797 


80.17 


6 


1969 


23766 


87.41 


7 


1243 


25009 


91.99 


8 


801 


25810 


9 4.93 


9 


516 


26326 


96. 83 


10 


281 


26607 


97.86 


>10 


581 


27188 


100.00 







118 



109 



PERCENTAGE 




EKjC 110 



Nuitiber of Words 



The observation that index entries are usually one-word 
entries also has some impact on a variety of questions 
involved with the US(9 of indexes in agglomerated form . 
This will be discussed at some length in Chapter VI. 



References 



1. Dolby, J. L. , H. L, Resnikoff, and V. Forsyth, 

Com pu t e rized Library Catalogs; Their Growth, Cost, 
and Utility , M~. I . T . Press , Cambridge , 1969. 

2 . Bloomf ield , Masse , "Evaluation of Indexing , 3 . A 
Review of Comparative Studies of Index Sets to 

to Identical Citations", Special Libraries, December 
1970, 554-61. 

3. Kendall, M. G. , "The Bibliography of Operational 
Research", Operational Research Quarterly, 2(1960) , 31-6. 

\4 . Thorndike, E. L. , and I. Lorge, The Teacher *s Word 

Book of 30,000 Words, Columbia University^ New York, 

1944. 

5. Kucera, II., and W. N. Francis, Computational Analysis 
of Present-p^ American English , Brown University 
Press, Providence, Rhode Island, 1967, 

6. Mosteller, F., and D. Wallace, "Inference in an 
Authorship Problem, " Journal of the American Statistical 
Association , 58(1963), 275-309. 

7. Dolby, J. L. , "The Structure of Indexing: The Distri- 

bution of Structure-Word-Free Back-of-the-Book Entries", 
Proceedings of the Annual Meeting of the American 
S ociety for Information Science , 5 (1968 ) , 65-72 . 

8. Luhn, H. P. , "A Statistical Approach to Mechanized 
Encoding and Searching of Literary Information", IBM 
Journal of Research and Development ,1 1 (1957), 309-17 . 

9. Damerau, F. J., "An Experiment in Automatic Indexing", 
American Documentation, 16(1965), 283-9. 




120 

111 

i :: 



CHAPTER V 



^ ALGORITHMIC TEXT 




INDEXING 



121 



ALGORITHMIC TEXT INDEXING 



An index increases access to a particular corpus of 
information. Until recent times most indexes followed 
the text material in certain types of books- Although 
this may still be true today ^ the emphasis of research 
in’-o the nature of indexing has shifted to indexes 
of other types of corpora^ such as the permuted title 
index and its variants and the citation index , 
which index collections of document titles rather than the 
text of the documents. Indeed current information 
retrieval efforts appear to exclude consideration of 
back-of -the-book indexes. For instance, Salton (1) 
offers a brief discussion of term-oriented, or derived 
indexes, of which the back-of -the-book indexes are 
usual ly instances , but the applications he describes 
are to collections of document titles- T he Encyclopedia 
of Linguistic^, Information and Control T2l mentions 
only citation indexing . 

This chapter is also exclusively concerned with back-of-th 
book indexes; hereafter the term index will be used 
in this restricted way. 

The principal result presented here is an algorithm 
for the automatic construction of an index from runr g 
text in machine readable form. A preliminary versi 
of the algorithm was implemented by hand and used t 
derive the index to Dolby, Forsyth, and Resnikoff (3) - 
The version presented here has been programmed for the 
IBM 360/30 using a set of assembly code macros and 
tested on a set of 50 abstracts of statistical papers 
published in the Annals of Mathematical Statistics 
and a second set of abstracts published in Cancer Research 

The difficult question of determining what is to 
constitute an adequate index for a given corpus of 
running text is not considered here, although reference 
is made to an earlier study (Dolby (4)) that considered 
certain obvious statistical characteristics of 
published indexes as well as to the previous chapter . 

The cost of deriving the index entries and formatting 
them into standard format is approximately 2<? per line 
of input text, based on standard commerical rates (west 
coast of the United States) . 



122 



Let us assume that an index is an ordered collection 
of word sequences (or transformations thereof) from 
the running text together with appropriate locator 
designations (e.g. page numbers)* A reasonable first 
step in deriving such an index is to partition the 
text into a set of word sequences using, in this case, 
marks of punctuation and structure words to determine 
the sequence boundaries. 

Each sequence is then examimed to determine whether 
it should be deleted from the set. In particular, 
sequences consisting of structure words only are deleted. 

For reasons that will become evident later, sequences 
consisting of single words and sequences that occur 
only once in the entire corpus are also deleted from 
the set. 

Of the various possible transformations it is obvi- 
ously desirable to identify singular and plural forms, 
to invert certain word sequences (at least selectively) 
so as to provide access to words occurring only at the 
end of the word sequences, and to superimpose a "see” 
and "see-also" facility to permit more complex transformation. 

Implementation of such an algorithm requires repeated 
access to various lists of words and morphemes. Computer 
time will obviously be strongly influenced by the 
strategies employed to accomplish these comparisons. 

To cite the most obvious example, it is clearly more 
efficient to store the list of structure words (which 
is relatively small but contains many words of high 
occurrence frequency) rather than the list of content 
words which has the converse properties. 

Where possible, significant gains can be made by 
testing for word classes rather than for individual 
words. Thus, it is useful to identify all participial 
for*ms as these do not generally appear as index entries. 

On the other hand, provision must be made to allow the 
override of such rules for cases of particular importance, 
(c.g. str^^i_f:^ed sam pling is an important statistical 
entry that shbuTd not be suppressed.) 

As the function of these various lists is primarily 
to delete words from the index, it is convenient 
to refer to the lists as "stop” lists and the sets of 
override words as "go” lists. Although sufficient testing 
on a wide variety of subject matter is not yet available, 
it would appear that the stop lists are basically inde- 
pendent of subject material and the go lists are subject 



123 



114 



FREOUEIMCY 



Figure 5 . 1 

Word Frequency versus Rank 



Brown University Standara Corpus of 
American English 





124 

11:4 



dependent. Thus a careful study of available authority 
lists in the .subject field would be necessary to insure 
proper operation of the algorithm. (Such a study would 
be necessary in any event to prepare the "see” and ”see- 
also" entries.) 

Preliminary segment boundaries are established by 

marks of punctuation (other than the hyphen and apostrophe) . 

Within the segments thus established , further boundaries 

are introduced between sequences of consecutive stop 

words (see Table 5.1) and non-stop words. As a simple 

expedient y all words in the stop list ending in s 

have the s removed and the match between the current word 

and the stop list is made after the final s (if any) has 

been removed from the current word. More sophisticated 

"plural logic" would be justified here only if the 

stop list were expanded substantially and in its 

expanded form contained a significantly larger number 

of "irregular" plurals. 

The selection of the words to be used in the stop 
list provides an intriguing problem. Clearly^ all 
structure words (neglecting archaic forms) should be 
included. It turns out also to be useful to include 
high frequency adjectives and verbs. It is therefore 
tempting to simply select the first n words from a 
rank ordered word frequency list . Unfortunately ^ 
there is no clear break in such a list in the vicinity 
of a reasonable cutoff (see Figure 5.1). Thus the 
cutoff must be mad;:i simply in terms of finding a 
reasonable trade off between added machine In 

testing against large lists, and added edli:i>^g costs 
at the other end due to failure to su’';-res£ words. 

Based on the developments of Chapte;. II, we would 
expect the cutoff to be in the order of 1/30 of the 
vocabulary. The word list used here has been purposely 
kept short during programming and should probably be 
expanded by a factor of two or three in actual use. 

The list organization as presently implemented is 
also quite simple: as the word length (in characters) 

of the current word is known at the time of the match, 
the list is broken down by word length and arranged 
alphabetically , within the sets of each length - Matching 
is done sequentially with termination on a match or 
when the curren\^ word is low to the list. Expansion 
of the lists would probably make it useful to use a 
hashing technique . 

The next segmentation stage consists of segmenting the 
sequences of non-stop words into consecutive sequences 
of words ending in ed, ing , or ful and sequences of 




125 

116 



TABLE 5.1 



an 

at 

be 

by 

do 

go 

ha 

he 

hi 

if 

it 

on 

or 

me 

my 

no 

so 

to 

up 

wa 

we 

all 

and 

any 

are 

but 

can 

did 

due 

few 

for 

get 

had 

her 

him 

how 

let 

may 

new 

not 

now 

off 

old 

one 

our 

out 

N,B. 

thus 





SHORT LIST OF 


STOP WORDS 






ARRANGED BY 


WORD LENGTH 




own 


some 


three 


general 


put 


such 


under 


improve 


see 


take 


until 


include 


she 


tend 


usual 


instead 


the 


term 


where 


operate 


tlii 


than 


which 


present 


too 


that 


while 


previou 


t\70 


them 


whose 


provide 


way 


then 


wider 


require 


who 


upon 


would 


several 


you 


very 


yield 


similar 




well 




special 


also 


v/ere 


become 


through 


back 


what 


before 


unknown 


been 


when 


behave 


without 


both 


will 


better 




'omc 


with 


cannot 


consider 


uown 


work 


change 


original 


each 


your 


chosen 


possible 


even 




denote 


satisf ie 


from 


about 


depend 


together 


give 


above 


derive 




good 


admit 


discus 


arbitrary 


have 


after 


either 


different 


here 


among 


extend 


excellent 


hold 


begin 


higher 


important 


into 


could 


implie 


otherwise 


just 


drawn 


little 




know 


first 


permit 


additional 


last 


found 


relate 


elementary 


lead 


given 


reduce 


particular 


lend 


great 


result 




like 


imply 


second 




long 


known 


should 




made 


might 


unique 




make 


never 


variou 




many 


other 


wherea 




more 


refer 


within 




most 


right 






much 


sense 


against 




must 


shpwn 


another 




only 


since 


because 




over 


still 


between 




part 


their 


certain 




said 


there 


consist 




same 


these 


earlier 




show 


those 


further 





All one-letter words are stopped* Terminal ^ is removed, 
ha stops has • 



117 

126 



words not ending in any of those four suffices. The 
current go list to override this segmentation consists 
of only three words ( family stratif ied ^ and sampling ) 
and is included only to insure that the facility 
exists in the program. 

The structure words ^ and are not included in 
the main stop list so as to allow sequences such as 
analysis of variance and convergence in measure to 
emerge as index sequences. However, it is clear that 
primary index entries do not include entries beginning 
or ending with o^ or in. Hence the final segmentation 
step is to segment beginning or ending occurrences 
of these to words from the non-stop, non- (e^, ly , 
ing , f ul ) word sequences . 

Following a suggestion of John Tukey , we have investigated 
the utility of "stopping" all short words, i.e., 
words with f ewer than a characters . Such a procedure 
would clearly speed up the program and set aside the 
difficulty of running down a number of short words 
that occur with sufficient frequency so as to be included 
in a reasonable system, (such as those occurring ift 
Latin phrases). Based on present experience, it appears 
that suppressing words with fewer than four characters 
is reasonable . This procedure has been used in the 
experimental run on the 50 abstracts from Cancer Research , 
but not on the two earlier examples presented here. 

All segments other than those consisting wholly of non- 
stop, non- (ed, ing , ful ) , with beginning and ending 

of and iri removed, are deleted. Of the segments 
remaining, all segments consisting of single words are 
also deleted. Experimentation with this step in the 
procedure stems from an observation made in Dolby (4) 
that one word entries in published indexes occur with 
surprisingly low frequency. Hence, the obvious 
strategy is to suppress ail entries with exceptions 
rather than to pass all with exceptions. 

The override to single-word suppression can take 
several forms. First, a go list can be appended (though 
none is used :n the present implementation). Independence 
would be an obvious choice for statistical subject 
matter. Second, proper names, that is, words in all 
caps or initial caps could be used as an override. 

(This was done in the manually implemented version used 
on Ref. (3) but has not been exercised in the machine 
implementation.) Finally, single-word primary entries 
can, and do occur in the inverted entries studied below. 

This reduced list of segments, or possible index 
entries, must now be transformed in certain obvious 
ways both to achieve proper compression in the final 
index and to provide at least the appearance of a 
manually prepared index . One obvious consideration 



i.nvolVGs the problem of identifying singular and plural 
forms- Again, a relatively simple strategy is sufficient 
to take care of most of the problem. Plural forms are 
rarely used as modifiers and when so used are used with 
a high degree of consistency. Thus if least sq uare s 
method occurs, it is highly unlikely that least squ are 
m eth od will also occur (though least squares methods 
might well occur) - Hence it is only necessary to 
prepare for plurals that occur at the end of the entry. 

The most frequently occurring plural form is obtained 
by adding ^ to the singular form. If the final ^ is 
replaced by a code that will sort immediately after 
blank (but prior to a) it is possible to compare 
successive entries after sorting and to eliminate the 
final ^ from all entries that follow entries that 
are otherwise identical . The f inal s^ is then restored 
in all other cases. In the application to the statistical 
abstreicts 311 of the 946 entries ended in s. Of 
these, 41 were stripped of the final ^ to provide 
the required identification. More sophisticated logic 
of the same variety could be added to handle plural 
forms such as processes , densities , and matrices although 
a quick survey of the 946 entries disclosed only four 
such occurrences where identification was desirable. 

Another purely manipulative step that must be introduced 
at this stage is the generation of inverted entries 
to provide access to words occurring at the end of the 
text ordered entries. There appear to be two main 
forms of interest. The first, typified by analysis 
of variance, can be implemented by the obvious algorithm 
that produces v ariance , analysis of . A more sophisticated 
form could be used to suppress one or the other of the 
two variants. A pair of relatively short, subject 
dependent, stop lists would probably suffice for this 
purpose . 

A second type of inversion, typified by mapping normal 
d istribution into distribution, normal could either be 
implemented by a go list of modest proportions or by 
ordering the entire set of entries by last word and 
then inverting all sets involving a common last word 
of sufficiently high frequency. Neither of these alterna- 
tives have been tried at this time, though some statistics 
have been gathered on the behavior of statistical terms 
from this point of view. 

In addition to the deletion of one-word entries it 
is evident, when one operates on full text, that it is 
entirely safe, and indeed quite useful, to delete 
entries that occur only once in the text. Intuitively, 
one can argue that if a term is not mentioned at 
least twice (allowing for plural variants and the like) 



119 

128 



tlicn there is little likelihood that enough information 
IS presented about that entry to make it worthwhile 
as an entry in the final index. Practically, an 
examination of singly occurring entries in the samples 
wc have studied thus far makes it clear that this is 
a highly useful device for eliminating much of the 
"noise" that inevitably is present when one takes such 
a simple view of English syntax. Statistically, the 
step can be justified on the grounds that the resultant 
index is of the proper size (as a percent of the volume 
of the book indexed) when such entries are left out, 
but noticeably too large if they are left in. 

The use of this device must be tempered by knowledge 
of the text. For instance, this device was not used 
in the index to the statistical abstracts, as it was 
evident that the abstracts did not possess sufficient 
redundancy to allow proper operation of such a mechanism 
within an abstract, and it seemed unwise to base the 
use of such a mechanism on a (not necessarily homogeneous) 
set of abstracts. Presumably there are certain books whose 
text has a very ].ow redundancy; for these this type of 
deletum should not be impletmented . 

The manual implementation of the algorithm on book 
length material (reference (3)) is shown in Figure 5.2. 

Two systematic departures from the general algorithm 
were made in implementing it: first, names of States 

were systematically deleted from the index ; second , 
a list of special words for inclusion in the index was 
used, containing names of countries and languages. 

Both decisions insure uniformity of in- or ex~ elusion 
of terms in each class without regard to the relative 
significance of each usage. Finally, as described 
in the Instructions for Use of the Index , two index 
terms were manually inserted: the collective Computer 

La nguag es , and the alternative World War I f or“ the 
alTgoTirthmical ly occurring First WorTd War . 

Perhaps of greater theoretical interest than those terms 
that appear in the index in Figure 5.2 are those terms 
that were deleted by the requirement that each entry that 
appears in the index, except for entries having special 
format properties, refer to more than one location in 
the text. Table 2.4 lists those word sequences which 
were excluded from the index for this reason. Preceding 
some of the words are letters which describe properties 
of the word sequence: *p* indicates that the sequence 

is a plural form of another word sequence selected by 
previous steps of the algorithm; the plural sequence 
is therefore equivalent to the singular one, aj.d hence 
in the final index. Sequences preceded by 'i' 
appeared in italic type font. It appears that this font 





Index 



Instructiotui for Use of the Index 

The index is the result of applying an algorithm to the text of the book; 
a minimal amount of (probably mechanizable J subjective human post- 
editing in the final twi) steps produeed the amalgamated and reordered 
form that is printed below. 

All word sequences that are not printed in italics appear in the given 
form in the text of the book, apart from possible differences of 
capitalization. Terms that do not explicitly appear in the text do not 
appear as index terms with the exception of the collective Computer 
Languages, and the alternative World War I for the naturally occurring 
entry ” First World War. 

Those readers who are experts in information retrieval and automatic 
indexing may be interested to know that this is a 4 percent index. 



access“pcr-item, 15, 16 
access files, 16 
access points, 18, 25, 154 
accession distribution, 103, 104 
accession number, 139, 141 
accession year, 102, 103 
accessions growth, 98, 102 
acquisition expenditures, 8, 9 
acquisition growth, 8, \03 \ see also 
accessions growth 

AID, I 1 9; see also Agency for 
International Development 
algebraic notation, 57 • 

Agency for International Develop- 
ment, 1 1 9 
Algeria, 120 

ALGOL, 27, 54\ see also Computer 
Languages 
Aiphatype, 64 

Alphavers Book Condensed, 65 



ALTEXT, 57, 58, 59; see also 
Computer Languages 
American Civil War, 5, 104, 1 12 
archival collection, 17, 39, 95 
archival component, 12 
archival libraries, 16, 18, 122 
Argentina, 122 
Asia, 128 
Author, 146 
authority list, 36. 8 I 

Baltimore County Library, 156 
Belgium,, 1 1 9 

Bel! Telephone Laboratories, 55 
bibliographic description, 75, !37, 
139, 144, 153 

bibliograpliic files, 18, 136, 137 
bibliographic material, 71, 150 
bibliographic record, 1 6, 18, 71, 
73, 74, 81, 82, 83, 149 



159 



& 






Figure 5.2 

ALGORITHMIC INDEX TO REFERENCE (3) 




121 

.130 



00 o 

ON 



ON 



TJ 

a 



r« 

O S 



"5 O r < 



SP ' 

o 



}3 G 
ac f:? 



ri & 
°o ^ 
X 



»r> 

'C06 



in 2 
r~ <L> 
_> 
‘S 

t- X 

T3 "O 

Id § 
X 



:*p 



— 

*-j 1 ^ ^ 

^ *1) c 

C *r\ 

ct 



_ 

JC 



_ ° a 

^ fc* 3 

5 - Ji o -3 

fc £ .0 -K o 

t 3 TO V ; 



- T 3 T 3 

■' ' 13 



, O' r^ 



c: 

o 



TJ- 

t*- 



O U 



; , 



>> o 

S a. 

S o 

V E 

O o 

4 ^ c 

» s 

W w 



s . b 5 I 

•o rt - o 
•S ts 5 “ 
00 ..u 

II s g 



0—32 S.'t y. 

-^■ g ' t ; s '~. 2 ogo 

>; 0 '. « s a a ", g c - 

. SS : 3 : 3 '^»-'*-' OOoO ^ 

*e 2 2 £ ti 13 

E «= cs t fc fc X 



-f 

o 



4 

On 



-r 

o 



<7n 

NO rn 
• r I On 
00 — < 

NO 



. »n 



O m r- 
. r- m 



criS 



6 fu 

_ T _ rt 3 



— * rt 

Tf T3 



1 1 s 

.5 3 r-" 
^ -e "S 
<v> if 

^ o 

™ ir; «-i < 

•o "O 9i< 



NO c: 

. o 71 

to I 

c cu ^ 

^ -o ru - 
— c c: c 

5 2 2 -2 

I .'ll X w n 

■* r*; rd rt ^ 



.2 .H " o -a 



gg-sgs s 

m X a X 



a o d; s S 1 5 1 I I «- § g 1 1 2 1 >: a 

* s « 'o p ? o . S ' 2 - e -^ •§ 5 ° <§ '!‘ 3 § 

.§ .§ ,g .§ £ .5 ^ .5 ^ a IS a 



X *n 



^ L?i m 

CQ 






t; o 

«) V 



ll > uujUJSvvUU^v 



r<» u 

z: '£ 



o 

o 



O' 

00 



Tf 

o 



o 

o 



in 

On 

V r4 ’ 

■0,0' 



i 



-f m 

-On — * 



^ ^ On 
_ 00 

^ T3 *-i 

- o IS 



a O" 

>1 E 
*« 3 

I-. 1-*^ _. 



:-s2 
ass 

pLi U < ( I4 



'I " c5 
||S 

o 

KK4 



On t-T 

«l;2' 

. - i *-« 

07 ; - 

in jd 2 ; 

— I ^ 

cn H H 
- ^ 05 
op 
Pu Pu 



S' 

to 

K 



It 

r4 



O O 



ON^ 

3 r4 — X 

e 

g oT xT J=! 

o o o 
^ ES C c 

C £ fi 

u * Oi Pu 



On 

NO 



NO 



r - ’ 

r< 



^5! 

^ o 



d 

O 



2 Ha 



Ow o. 3 
<a fd O 



o 



cu 

U ' J 

■;!£ 



o* 

o 



o 

x> 

>» 



in C 

■^. § 

fn 

. <w 

t>o 
, c 

O < 

'"■si 

U ^ 

< 



»-l in O' 

2 *■« * 

o ^ 



m 



U 



00 m p^' c 

2 ; 2 C 

r- r4 ' - 

5 -- 5 ^ 

g §<^ 3 ) 
S 6^""- 
(S <S o 



3 '.d 
T3 «J 

0 3 

M o" 
Du 

—I t/i 

1 - 

o ” 

*3 u 

32 

o. 

co 



' II Ou 

» - 

^ ^ - no' O 

._j *J ^ <4> 

X 3 ^ «u 

• t 3 *o 

m ^ g ..- 

- ^ w m m 

C ^ a. o rn 

.2 »n ^ ^ ^ 

S ^. § g ^ 

o- ^ t; ^ 

't o ^ 2 

4-» ^ 22 2 I 

— rt 15 ^ 

2 £ o 

00 a 






E 

•o o 
■“ o 



n> >- 

73 ac 

>- U 



ON - 
NO VO 
. in 

00 „ 

nD m 
. m 



c - 

X 

• vt 
C3 Q> 

15 C 

E »- 

"o 

U 



3 ■'t 
c\.in 

E > 

O rn 

O 

g > 






E 

o< 

■3 

o* 

<L> 

cx 

.5 



E X 

Is 

o p 

u O 



^ ON 
“ VO 



b o 

a ^ 



o 

u 



o 



bo 00 
2 *n 
H in 
•>0 r-. 

o H 

o .b «•. j Li 
tn ’tn ^ O W 

0 o g o H 

01 O' S* j □ 

iil<< 

u o L> 



m ro 
r4 

ot 

CQ ^ 

O O 

U U 



r~ 

Tt 

m “ 

rrT<^ 

X < 

On S pi 

NO o O 

U Ui 



00 
<N rJ 



VO m t>J' ^ 

<n ^ r» O 



(Xi (X 4 



o 

00 

*— i m 
- On 
" NO 



t-j t-j pLi cr 



m Tt 

U .-1 

O O 
m m 
On o O 
^Z.Z. 

bp CO 



-ctn -:a 

O > tJ 

n to _ i > Q 

M O j3 
O i-< •*-• 
U 0< 3 



tJO 

U 

< PlT 
S O 
pa cw 
HX 



c c 

.2 .2 

t/5 t/> 



> > Q 

c G tr 
000 



E 

(X 



.9* *- 



o 

' Tt ^ 

1.0 m 

' r- rf r- 
- ^ . m NO 

NO - - 



4-i o 

8“ 



C ' On rn *-i NO 
O rj rJ - - m r'* 

: s .2 „- g “ « 

! o e .s X c £• 13 

i- c > -S 2 ° a B 

i E p? ^ M e ^ X 

o o fS a* ^ o 

U U K O cx O* cn 




— 'P , 



(U C 
cd O 

JD O. 
cd cd 



ri 

NO 



^ Oi , ^ 

. • J «5 On 2; 

in C NO S 



'3 "S 

2 -a - " 

o i 3 m a g 

2 2 'O 5 -a 

CX CX „r cd E 

§,§,« Sn; 

00 \J 

■5 ^ 2 Q o 
*2 *2 o o o 
X» .0 J3 o PQ 





« o o '3 
M o cd g g 




ERiC 



131 



122 



VO 



•s - 

JS -c 

o 

u. 

3 



O 

^ m 



a: oi oi 



*a 

3 



kh 

C 

o 

U 



^ OO 

2 

,o ^ 
^ c 
0 > .2 
■£ 



V,- 


w .s 

ex'-* 
2 « 


S^|c 

*a 3 


Q. ° 

CX lx 


lx ^ 

0 O 


a &> 
CX (X 




c 



'I 

o 



o 

ERIC 




o 

rt- 



« * ' OJ a > 
rr r* 



^ ■5 - 

c 

u O •— ) 

^ '3 
•s=>i>- 
"i c 3 

■S'Si 



ro ro 



5 >• m rn 

D^r 

<-* • “ trt 

e 



3 ^ § iJ 

§1”'’ 
^ c 
W o 
V*, n 



a e 



n 



Tf 

o\ 



a 

o 

3 

X) 

C 

, 

' T3 



3 .j 



m* 'o 



cx 

I; 



PU 

z: 

o 

cx 

3 

O 



ri 



) 



' I " 

r» •"■ , 
VO * 

-°: 

VO 



VO 



■ «o 

i ro 






O ' 



OO ^ a\ rf 00 



cT rA 
o — 



rt — . 

ov m 

r r . 

Ov vf 
r-. ^ 

^ -U 
- »J 
r I rvi 

I— I ^ 



a 

g 

i 3 

3 

cr 



m 






c: 

«3 



3 

Q, 

E 

Cj 



• O ri 

S!-" 



bOO b0>-4 



5 3.- 
:^M . 



6 )S 

C ‘3 

w " ~ 3 

5P^ CN ^ U CTfr-. 
«* -ov^rr^^ 

-H ^ O 

u 



!■§ 



•a-s 

Jj 



«r^ o 



r>i ri 

V c" 
n g 

•ti 

> o 



U 

n 4 



•a| 3 S 

J3 a 



t/T ir» I — ' 

00 

rt uT o 
(j *-. booo ‘rj 
^ JSv O - u 

Op *3 rvj o> 
O T 3 *5 Ij ^ 

SSd^^S 



m n 12 



^ 

O ■'“* *-* 



3 

3 



r ov 



§2 

O . 



^ 22 
a- 2 
m O ~ 

i> .2 2 ^ ^ O , 

W) 5 ^ c 

<vl r* ~ S 



s 

o 



rt 

lx 

JD 






S'*: 

’c: ov 
Q,ro 
o 

1 '? 

o 

oi ^ 



in ^ 



o A 

•c O 

2 a: 

U O ^ 

^ ^ a> 2 

^ J <x> o 
'"21 
u d S? "> 

> V-/ <u 

• ^ ^ • -s 

o 

o^ c 

*-* 3 - ;_> 



> 2 i 

■ oo" 



<u 



ra; 3t 



OJ 

” O 

X, 

o 



a c 
o O 



*3 

4 lx 

o 



a 



c »o 



E 

o 

u 



2 t/i 

*53 

^ O^J 

a ^ 
S'-S 

“ c /1 
O;^ 

J=} s 



& 

cd 

lx 

JD 

^ a, 
.2 ^ 
S < 

3 *-1 

^ /V* 

w Oh 



< o 

»-J (N 



VO VO >, vo' 



e M . c*-r S b ^ 

5 5> 4>tS^ 

^ e *.3 5 ^ ^ ^ 

O R fe &..0 ^ 

a ;3 



lx 



CJ ^ 



€>cw >3 ^ 

CO 

□ 



O T3 

e ^ 

il 



c/1 

vO « 

- *53 

>V dO 

G G 



€> (d fj 

30 ^ 



D.<? 



3 

O 

o 

E 

u 

X 

3 

i 4 



C 

:s 

o 

cd 

a 



132 

123 



164 Index 



SUL, 41, 43; see Stanford Under- 
graduate Library, Meyer Under- 
graduate Library 
Switzerland, 119, 120 



tflcplione directory, fi8, 69, 155 
TLMAC, 28, 56, 57; see also Com- 
pitfcr Languages 
time intervals, 8, 98 
title field, 141, 143, 152 
title information, 74, 139 
title list, 2, 153, 154 
titles per subject, 147, 14£ 
trend line, 1 2 
Type Face Design, 61 
type face, 61 , 62, 64, 65, 68, 70 



UC'/O, 40, 43; .vec University of 
(’alifornia/Uerkeley 
United Kingdom, 127 
United Slates, 98, 100, 103, 112, 
122, 123. 124, 127 
United States GNP, 1 24 
United States Gross Wational Prod- 
uct, 103 

university collection, 8, 130 
university library, 5, 12, 24, 38, 51, 
92, 130, 133, 137, 140, 156 
University of Californii/Berkeley, 
40; see UC/B 



University of Chicago, Graduate 
Library School, 53 
University of Newcastle-upon-Tyne 
Library, I38 ;I(Pc Newcastle 
Uruguay, 123 
U.N., 119 

U.N. Yearbook, 119 
U.N. Yearbook of National Ac- 
count Statistics, 119 
U,S. Department of State, 1 19 



Versatile Bold, 65 
volumes per card, 88, 94 
Volumes per Title, 94 



Wall Street Journal, 31 
Wang algorithm, 56 
West Germany, 127 
Widener Library, 2, 155, 156; jce 
also Harvard 

Widener SbcK List, 75, 148. 155 
World Almancrs. 2 
World War /, sec First World War, 
Great War 

World War 11, 5, 100; see Second 
World War 
World Wars, 5, 124 



XPOP, 57, 58, 59; see also Com- 
puter Languages 




133 

124 

1^0 



TABLE 5.2 



Excluded Index Terms Referring t*o One 



P 



abnormal parenthes i zation 
absolute frequencies 
academic staff 
access capability 
access point 
access system 
accessible estimates 
accession distributions 
accessions growth rate 
acquisition rates 
acquis i t ions data 
acquisition mechanisms 
acquisition process 
acquisition rate 
acquisition schedule 
acquisition shares 
acquisition structure 
acquisitions - GNP relation 
- GNP share 
budget 
growth 

expenditures 
access 



Baltimore County Libraries 
bedroom s ta tes 
biases inherent 
bibliographic descriptions 
bibl iographic 
bibliographic 
bibliographic 
bibliographic 
bibliographic 
bibliographic 



holdings 
indications 
items 
listings 
lists 
notes 



acquisitions 
acquisitions 
acquisitions 
acquisitions 
adequate user 



equal itybook 
Book 



algebraic equations 
algebra ic expressions 
alpha-numeric code 
alphabetic code 
approximate linearity 
approximate normality 
archival collections 
archival holdings 
archival libraries 
archival records 

Assembly code programming 
"assembly languages " 
as s ignmen t procedure 
author access 
author field 
author list 
author name 
author/title list 
authority lists 
automated catalog 
Auxiliary memory 
average cost 
average growth 
average number 
average record length 
average time 



bibliographic preictice 
bibliographic records 
bibliographic references 
bibliographical information 
bibliographically incomplete 
bibliography 
Bibliography Field 
bibliography section 

publication depressions 
Length 
bookseller 
budget dollar 
budgetary requirements 

business community 
calculu53 text 
call number 

(Canadian) census figures 
capital letters 
capitalization conventions 
capitalization errox's 
capitalization requirements 
card catalog collection 
card collection 
card files 

card space convention 
card system 
cards per entry 
cards per title 
"careful" study 
case alphabet 

catalog card conversion errors 

catalog cards 

catalog data 

catalog files 

catalog interrogations 

catalog preparation 

catalog productions 



me 



125 

134 



catalog records 
catalog trays 
cataloging function 
cliaractcr density -per page 



p cost figures 
cost function 
cost increments 
cost item 



character density per square inchcost levels 



character manipulation 
Circulat ion 
circulation file 
circulation rates 
civilization’s material aspects 

class category 
class fields 

class order p 

"Collected Works" 

Collection Breakdown 
collection sizes 
collection subset 
common machine 

component national growth rates 
composite costs i 

composite estimate columns i 

composition costs 
composition devices 
composition practice 
computational ease 
computational facilities 
computational linguistics 



cost per title 

cost point 

cost structure 

cost picture 

cost study 

cost variations 

costs per thousand dollars 

county library automation projects 

County 1 ibrary systems 

county school system 

county system 

data conversion 

data files 

oat a ob j oct structure 

data objects 

date of access f ie Id 

date of order field 

decimal classification system 

density output 

detail level 

dictionary lookup procedures 

document descriptions 

document identification procedures 



X Computer Line Printer Type Faces dollar equivalents 



computer programs 
computerization costs 
Condensed 
consecutive years 
contents 

context permissible 
conversion expense 
conversion problem 
conversion procedure 
conversion process 
conversion task 
copy output cards 
core memory 

correction capabilities 
correction costs 
correspondence files 
cost 

cost area 
cost breakdowns 
cost equations 
cost estimates 
cost factors 



dummy entities 
dynamic aspects 

economic .analysis 

economic aspects 

economic data 

economic depression 

economic disintegration 

economic references 

economic size 

economic state 

economic statistical data 

economic strength 

economic units 

edition statement 

educational advantages 

electronic photocomposition 

electronic typesetting devices 

elementary calculus • 

English-language sentences 

English-speaking 

error-checks 



ERIC 



126 

13 %. 



(first) 



O 

ERIC 



err or- cor recti on capability 
European 

executable statements 
expansion ratio 
"explosive’' growth 
exponential curve 
exponential expansion 
exponential function 
exponential imprint date 



exponential library growth rates 

exponential rate 

faculty library committee 

feedback response p 

field names 

fields 

fields per record p 

file figures 

file maintenance 

file records 

file structure 

file system 

financial data 

f inancial community 

f inancial transactions 

generation machines 

fixed absolute growth p 

floating point arithmetic 

follow-up corre spondence 

foreign language acquisitions p 

foreign language documents 

foreign titles 

Format- Dependent Errors 

format capabilities 

format compromise 

Format control 

format elements 

format requirements 

French“Af rican 

French-speaking 

functional collection 

fund name 

fundamental processes affecting 

fundamental structure 

future funding needs 

geographic area 

geometric decrease 

global category 

global check 

global war 

GNP- acquisitions relation 



GNP at Market Prices 
graph paper 
graphic arts 
graphic representation 
Gross Domestic Product 
p gross national products 
Gross Personal Income 
ground- level extension 
distribution growth challenge 



growth periods 
growth phenomenon 
growth problems 
growth rates 
growth statistics 
hardware costs 
Harvard samples 
" higher level" languages 
historical events 
historical significance 
human costs 

human readable document 
identification number ’ 
illegitimate code 
implementation cost 
imprint data 
imprint dates 
imprint date growth 
imprint decade 
imprint distributions 
in-depth studies 
in- school access 
income data 
income growth 
income ratios 
indented 1 ine s 
information base 

fields 



information 
information per inch 
information per page 
information run -over 
input costs 

libraries input errors 
input format 
input methods 
input, program 
inquiries per record 
instruction per second 
in ter column space 
inter entry blank lines 
inter library loan service 



127 

! 36 



i 

P 



P 



P 



P 



P 



P 

P 



intcrlibrary loans 
interword spaces 
interpretive approach 
item 

item fields 
item purposes 

items per year p 

jouriial- to- language assignment p 
journal titles 
key economic issue 
key information 
key library personnel 
key words p 

keyboard conventions 
keyboard opera tors 
keypunch equipment 
labor categories 
language acc|uisi tions 
coun t 
expertise 
Field 
groups 
information 
linguistic algorithm 
language shares 
Latin American 
library activities 

applications 
card catalog conversion 
card catalogs 
catalog card contents p 
catalog operation 
catalogs 
characteristics 
collections p 

community 
context 

cost structure 
expenditures 
explosion 
facilities 
file operations 
files 

holdings p 

Management Tool 
market 
materials 
mechanizati on 
of Congress acquisition 



language 

language 

Language 

language 

Language 



library 
library 
library 
library 
library 
1 ibrary 
library 
library 
library 
library 
library 
library 
library 
library 
library 
library 
library 
Library 
library 
library 
library 
Library 



Library of Congress acquisition shares 
Library of Congress classification 
system 

Library of Congress nonserial 
acquisitions 

Library of Congress size distribution 
library operations 
rocess library personnel 
library procedures 
library services 
library shelf lists 
library structure 
library systems 
library like activities 
line printers 
linear string 
lines per second 
linguistic biases 
linguistic constructions 
linguistic da ta-ob jects 
linguistic exploi tation 
linguistic records 
linguistic partitions 
linguistic subpopulation shares 
list St rue 
literate population 
load requirements 

location in formation > 

log graph paper 
logarithmic graph paper 
logarithmic scales 
Logical operations 
loss rate 

” lower leve 1 " languages 
machine-readable catalogs 
machine- readable library catalogs 
machine- readable materials 
machine-readable subject authority 
lists 



machine 


change 


machine 


design 


machine 


Clements 


machine 


inquiries 


machine 


languages 


machine 


language instructions 


machine 


methods 


machine 


output 


machine 


rules 


data 


machine time usage 




128 



137 



machine use 

magnetic cores 

magnetic discs 

magnetic drums 

magnetic tapes 

main file 

management tool 

manipulative operations 

manpov/er costs 

manual generation 

manual operations 

manual strategies 

manuscript form 

map classification category 

marginal improvements 

mathematical computation 

mathematical exercise 

Mathematical Journal Titles 

matliematics 

mathematics faculty comn'iittee 
mathematics journals 
mean growth 
mechanical errors 
mechanical translation 
mechanization context 
me tliodo logical principle 
Misspelled words 
model cost equations 
monetary inflation 
monograph collection 
monographic letter frequencies 
month portion 



multi-language manipulation procedures order 
multicharacter vowel string ordinal numbers 
multiple copy graphic arts quality author list 



nonoriental monograph acquisitions 
nonpamphlet items 
nonserial Fondren sample 
nonserial shelf list cards 
nonserial textual works 
nonstationary growth periods 
nonstationary intervals of 
library growth 
non stationary time series 
"normal probability paper" 
normal distribution 
normal probability distribution 
normative measures 
number field 
numeric symbols 
numerical computation 
off-site areas 
on-line input 
open -stack libraries 
optical character recognition 
equipment 

optional parameters 
order-of -magnitude changes 
order date 

records 
magnitude 
magnitude 
magnitude 
magnitude 
operation 
system 
system file 

system reports 



order file 
(order of) 
(order of) 
(order of) 
(order of) 
order 
order 
order 



cost reductions 
cost variations 
decisions 
gains 



musical scores 
national accounts 



out-of-date catalog 



national 

national 

national 

nationa], 

national 

national 

natural 



statistics 

growth 



economic 
economy 
origins 
publications 
publishers 
s tatis tcial data 
languages 



(new) acquisitions information 
non-English words 
non-numerical procedures 
nonlibrary customers 
nonlinear scales 



output error s ignals 
output list 
output machines 
output pr inters 
output sheet 
page counts 
page design 

collection processes paper costs 

(paper) tape input 
parallel search logic 
pattern-matching facilities 
pattern-valued functions 
pattern primitives 
per capita growth 



ERIC 



129 

138 ‘ 



per unit basis 
percentage growth 
personal author 
personal authorship 
personal incomes 
photo-offset reproduction 



(physical) volumes per serial title 



random samples 

record entry 

recursive processes 

refugee movements 

relative frequencies 

relative frcquecy distribution 



pilot study 
plan t expansion 
political disintegration 
political issues 
popiila t ion growth 
potential control 
print runs 
prin ted copies 
(pi in t ing and ) binding costs 
printing cost 
Printing Type Paces 
private endowment funds 
Pjriva to Pinance 
probtibility scale graph 



relative merit 



relative performance 
relative significance 
relative size 
reliable data 
rental figure 
report system 
X reprogramming costs 
research effort 
research grants 
research purposes 
retrieval processes 
retrieval reques ts 
p retrospective files 
p retrospective ma terials 
run costs 



process flow 

X processing bibliographic records salary structure 
X processing linguistic information sample cards 



production cost 
production economies 
production processes 
productivity per dollar 
profound machine language level 
program errors 
program routines 
proper-name entries 
Proper names 

proper scale compression 
propositional calculus" 
public acceptance 
public card catalog 
public catalog losses 
public libraries 
public sales 
public use 
publication cost 
publicat: on field 
publication growth 
publishing industries 
punch paper tape 
quality performance figures 
quality point 

quantal jump characteristics 
quantal jumps 
random access 



school cooperation 
scientific ef f or t 
scientific machines 
scientific periodical literatu? e 
study scientific publication 

scientific research 
selection criteria 
selection operation 
selection procedure 
selection processes 
selection technique 
semibold type faces 
semi logarithmic paper 
p serial publications 
serials shelf list 
service bureaus 
set theoretic operations 
share distribution 
shelf list circulation file 
Shelf List Statistics 
shelf space 

significant acquisitions - GNP 
disagreement 
social dislocation 
social ideologies 
social phenomena 
social systems 



er|c 



... 



130 



e 



X 

X 



i 



X 

X 

X 

X 



p 



sort operation 
source statement language 
source statement structure 
special purpose bibliographies 
square inch 

staff expansion requests 
standard algebraic form 
standard algebraic notation 
standard precedence conventions 
stationary growth rate 
statistical correlation theory 
distribution 
distributions 
ensemble 
indicator 
relationship 
summary 
uniformities 



s tatis tical 
s tatis tical 
statistical 
statistical 
s til tis tical 
s ta tistical 
s ta tis tical 
status indicators 
s torago media 
storage space 
string contents 
string processor 
structure 

sub ject~or iented bibliographies 

Sub ject~Title catalog 

subject area 

subject areas 

subject bibliographies 

subject bibliography 

subject catalog volume 

subject classes 

subject coverage 

subject definition 

subject designation 

subject material 

subject matter 

subject volumes 

subject words 

suburban population 

summary information 

supervision costs 

symbol strings 

Symbolic Expressions 

systematic way 

tape costs 

technological advances 
telephone companies 
telephone directories 
text samples 



textual works 

third generation” computers 
time advantage 
(time and) motion studies 
time benefits 
time constraints 
Time Field 
time information 
time scale 
time schedule 
time variation 
title- word access 
title -word in formation 
X title card 
X Title cards 
title indices 
title languages 
transcription errors 
transliteration schemes 
tray contents 
trend curve 
turn-around times 
type face catalog 
type face size 
p type faces 
type fonts 
type size 
type styles 
undergraduate library 
undergraduate student 
Union catalogs 
X unit cost 
X unit costs 
p university libraries 

university library book catalog 

university library systems 

university order staff 

university rate 

usage rates 

user- 1 ibr ary complex 

"user codes" 

User cost 

(user) cost factors 
i utility 

utilization costs 
vertical scale 
e XYZ Library I 

yield rate per item 




131 




does not characterize indexible sequences. *x* indicates 
that a manucil error has been made; in some cases a verb 
gerund has not been deleted in the stop list step of 
the algorithm, so a sequence appears in the later stages 
of the algorithm when it ought to have been deleted at the 
first stage. For example, the sequence processing 
bibliographic records contains the structural stop 
sequence - ing indicating the gerund form; exclusion of this 
word at the stop list stage would have left the subsequence 
bibliographic records for consideration, which appears in the 
index anyway because it occurred in more than one additional 
location. The indicator *e* means that the sequence has been 
excluded by the human posteditor. Two such sequences are 
noted: s quare inch , which should perhaps inhabit the 

stop list, and XYZ Library , which must be considered because 
one of the spec ial format inclus ion conditions is that 
sequences containing all capitalized words are indexed 
regardless of the number of text locations to which they 
refer; but this instance doesn't supply any useful information. 
It is a stylistic curiosity. Finally, certain sequences 
in the table are preceded by a parenthesized word . For 
instance, (first) generation machines appears. The algo- 
rithm generated generation machines ; the preceding text 
word was included in the list to help the reader to under- 
stand the context of the sequence, which, following the 
algorithm, was excluded from the index. 

Quantitatively, this algorithmic index is not significantly 
different from the manually produced indexes analyzed in 
Chapter IV. The gross size of the index is 5 pages as 
compared to 157 pages of text, a text to index ratio of 
31.4 to 1. The index entry length distribution is given 
in Table 5.3. 



Table 5.3 

Index Entry Length Distribution 
Computerized Library Catalogs 



Number 
of Words 

1 

2 

3 

4 

5 

6 
7 



Frequency 

82 

190 

44 

13 

6 

4 

1 



Cumulative 

Frequency 

82 

272 

316 

329 

335 

339 

340 



132 




The percentage of one-word entries (24%) is higher than the 
average number of one-word entries in the subsample from 
the Fondren Index Sample (13%) . Although this is not a 
significant deviation (more than 17% of the indexes in the 
subsample had more than 24% one-word entries) it is worthy 
of some comment: the basic algorithm suppresses one-word 

entries, with exceptions. In th~.s case the exception rule 
was to include capitalized one word entries. Thus, even 
though the algorithm is designed to operate against one-word 
entries, the proportion occurring is still on the high side. 

The distribution of entries by number of words is shown in 
Figure 5.3. The distribution is reasonably approximated by 
the lognormal distribution. The arithmetic mean of the 
distribution is 2.08 words per entry, compared to 3.68 
words per entry for the subsample as a whole. Although there 
is again some cause to question whether this is a signifi- 
cant deviation, there is an underlying weakness in the form 
of the algorithm as it was used in this example. The algorithm 
excludes entries of the word X of Y . In (4) the structure- 
word-free entries were found to have a mean number of words 
per entry of 2.12, almost exactly the average found for this 
algorithmic index. However, the structure-word-free 
entries of (4) made up only 55% of the total number of entries. 
In Chapter VI we shall return to this question in analyzing 
the output of the basic algorithm where the capability to 
generate entries of the form X of Y has been included. 

The absence of structure-word entries also tends to depress 
the overall size of the index. Although the bulk size, 
measured in pages is approximately l/30th of the text size 
(as would be expected) , the ratio of bulk of the index to 
bulk of the text measured in numbe'^ of characters is 
approximately half of this figure^ (Not only are the index 
entries somewhat shorter than would be found in the manual 
indexes, the text density is approximately 3,150 characters 
per page as compared to the mean of 2,400 characters per page.) 

The lack of structure-word entries also tends to distort 
the page location distribution, (Table 5.4). 



133 



1421 : 




Number 



Table 5-4 



INDEX PAGE LOCATION DISTRIBUTION 
COMPUTERIZED LIBRARY CATALOGS 



Number ol: 

Page Locations 
_ Per Entry Freque ncy 



Cumulative 

Frequency 



1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 



127 


127 


102 


229 


52 


281 


18 


299 


11 


310 


8 


318 


6 


324 


2 


326 


5 


331 


1 


332 


1 


333 


1 


334 


1 


335 


3 


338 


0 


338 


2 


340 



The graph of the index page location distribution is shown 
in Figure 5.4. Here it is evident that the number of entry 
with but a single page location is significantly lower than 
the overall trend line for the rest of the data. Further, 
the bend in the data occasioned by this low value is sharper 
than for any of the distributions in the subsample from the 
Fondren Index Sample (see Appendix II) . Interconnection of 
the entries with structure words would clearly tend to break 
apart entries presently agglomerated, thus reduvcing the number 
of multiply occurring entries. Ignoring the low num±)er of 
singly occurring entries, the Zipf -Mandelbrot slope is 2.17, 
well within the range of values found for the manually 
produced indexes. 



The arithmetic mean of the number of page locations per 
entry is 3.19, nearly double the figure found for the sub- 
sample of the Fondren Index Sample. However, this value is 
distorted by the fact that consecutive page locations were 
not agglomerated into single locations as is normally vdone 
in manual indexing. When this factor is vcorrected, the 
average number of page locations per entry becomes 2.14. 

As this value would be further reduced by inclusion of 
structure-word entries it would appear that this variation 
is not at all significant. 



O 

ERIC 



135 



Figure 5 . 4 



Index Pc'.ge Location Distribution 
from the Index to 
Computerized Library Catalogs 





136 




NUMBER OF INDEX ENTRIES 



In sum, aside from the failure to include structure-word 
entries or to agglomerate consecutive page locations, the 
statistical shape of the algorithmic index to Co mputerized 
Library Catalogs appears sound. This .is not to~^say that the 
index is entirely comparable to a manually produced index. 
However, the first requirement in automating a process 
traditionally done manually is to meet the basic size 
constraints. Further developments in the technique will 
be illustrated in the next chapter to demonstrate that 
even closer approximations are possible. 



References 



1. Salton, Gerard, Automatic Information Organization 
and Retrieval , McGi^aw-HillT Book Co., New York, 1968 . 

2. Meetham, A. R. and R. A. Hudson, editors. Encyclopaedia 
of Linguistics , Information and Control , Pergamon 
Press, Oxford, 1969. 

3. Dolby, J. L., V. Forsyth, and II, L. Renikoff, Computerized 
Library Catalogs; Their Growth, Cost and Utility , 

the M. .T. Press, Cambridge, 1969. 

4. Dolby, J. L., "The Structure of Indexing: the Distribu- 

tion of Structure-Word-Free Eack-of-the-Book Entries " , 
I^roceedings of the American Society of Information 
Science , ^ {1968 ) , 65-72. 

5. Dolby, J. L. and W. E. Houchin, A Modular Suite of 
Programs for System ABC , R & D Consultants Co. , 

Los Altos, California, 1969. 

6. Dolby, J. L., W. E. Houchin, H. L. Resnikoff, and Roger 
Stark, Non-Numeric Programming Language Studies : 

ALTEXT II ., Final Report to the U. S. Air Force 
Offic^e of Scientific Research , Contract #F44620-69-C- 
0094, R & D Consultants Co., Los Altos, California, 1970. 






137 

14 ^-''' 



CilAPTER VI 



AMALGAMATIVE ACCESS MECHANISMS 




m 



AMALGAMATIVE ACCESS MECHANISMS 



INTRODUCTION 

The model proposed in Chapter 2 shows that the search 
for access mechanisms must be conducted in compress iv’e 
powers of 30. It is principally the relative size 
of an access mechanism that determine! its utility . 

That a compression of 30 must be effected in order to 
move from one access level to the next, and that the 
boundary between access levels corresponds to compression 
of about a factor of 5 implies that there cannot be very 
many possible access mechanisms to a particular level 
of information storage. For instance, if the level to 
be accessed is the book,^ then one must ask what natural 
subsets of information there are in a book which consti- 
tute about one-thirtieth of it. As has already been 
pointed out, the average index to the average book 
compresses the text by a factor of 31.8, so the book 
index is a viable access mechanism. Studies of abstracts 
of papers appearing in mathematical journals show that 
the average complete abstract produces a compression of 
about 30.6, so the journal paper abstract is also a 
viable access mechanism. The book abstract should require 
about 276.6/(2e)^ = 9.3 pages; we do not have reliable 
information about the average length of book reviewr in 
the professional literature, but thife appears to us to 
be a possible mean for scholarly reviews. On the other 
hand, the capsule reviews of popular books that appear 
in newspapers and other popular media, and in some schol- 
arly publications, are much shorter — perhaps the equiva- 
lent of one or two pages — and lie on the boundary between 
the levels of access mechanisms to books and access 
mechanisms to access mechanisms to books, the latter 
operating at the level of an enlarged table of contents 
such as regularly appeared in previous centuries, and 
still sometimes do, viz » , Hans Zinsser's Rats, Lice and 
History ' s table of contents from which we extract the 
following : 




139 




I. In the nature of an explanation and an 
apology 

II. Being a discussion of the relationship 
between science and art 

III. Leading up to the definition of bacteria 

and other parasites, and digressing briefly 
into the question of the origin of life 

IV. On parasitism in general, and on the neces- 
sity of considering the changing nature of 
infectious diseases in the historical study 
of peidemics 

V. Being a continuation of Chapter IV, but 

dealing more particularly with so-called new 
diseases and with some that have disappeared. 



and so forth . 

Another way of looking at the problem of discovering 
possible methods for accessing books is this: the number 

of characters in a book is about (2e)®; reduction of a 
factor of (2e) leads to an information store about the 
size of the index; further reduction by a factor of (2e) ^ 
to the next access level leads to a store of the size 
of the table of contents. Another reduction by the same 
factor produces (2e) ^ = 30 characters, which is nearly 
the size of a book title, as we have determined in a 
preliminary fashion from a small uniform subsample of 
the Fondren Sample. In fact, that estimate was 34.2 
characters for monographs in the sample regardless of 
language of title; had the subsample been restricted to 
English language titles, the average length would have 
been shorter. A final usable reduction is effected by 
another division by (2e)^, leading to < one character 
access mechanism such as that provided by the Library 
of Congress one letter class designation. 

The important point is that every access lev el is filled. 
Further suudy of possible new access mechanisms must 
therefore be constrained to access mechanisms of the same 
size as those that already exist. A natural question that 
arises is whether it is desirable to have two access 
mechanisms of the same size for a particular information 
system. That such duplication does already exist is 
easy to demonstrate : 



140 

149 



1. The Author^ Title ^ and Shelf orderings or a library 
card catalog are all essentially of the same size: 
roughly , one card image for each title in the 
collection, (The subject heading ordering is 
generally slightly larger^ but still at the same 
access level as the others,) 

2. The table of contents for a book is at the same 
access level as the catalog record, 

3. Abstracts to journal articles appear in abstract 
journals as well as the index entries that are 
frequently published *at the end of the year in 
the journal. Both of t-^ese access mechanisms 
are first order devices , 

and of course other examples involving titles ^ descrip- 
tors^ etc. can easily be found. 

Thus the size of an access mechanism, though it is of 
first importance in describing the nature of the access 
it provides, is not sufficient to completely describe its 
characteristics. A second consideration that must be 
taken into account is easily illustrated by considering 
the sequences: 

Article, Abstract, Title 

and 

Book, Index, Table of Contents 



* In the first sequence, each access device is acting 

simply to compress the contents of the primary information 
store. In the second sequence, each access mechanism 
j is itself a set of lower order access mechanisms collected 

and sorted in a useful ordering. The abstract and the 
[ title provide the user with the opportunity to determine 

i whether the document so described is likely to be relevant 

I to his need for information, in a general way. The index 

i and the table of contents provide the user with information 

[ about the contents of the document together with the 

I location of particular pieces of information in the document. 

The crucial question is that of agglomeration: an index 

is an agglomeration of entries; a table of contents is 
an agglomeration of entries; on the other hand both title 
and the abstract are entities themselves rather than 
being agglomerations of other entities. It seems clear 
from what has gone before that the minimal unit for 




141 

ip;n 



agglomeration is the first level unit (about 30 
characters Thus both the table of contents and the 
index are agglomerations of first level units. However , 
higher level agglomerations exist: the abstract journal 

is an agglomeration of second level units , as is a publi- 
cation devoted to the republication of the tables of 
contents of journals. Although we have not yet completed 
our study of dictionaries and encylopedias , it is clear 
that each of these important access devices are agglomera 
tions of higher level entities . 

In this sense, an access mechanism can be described first 
by its total size and secondly by the size of the primary 
entries that it agglomerates. Thus an abstract is zero 
level agglomeration of second level entries ; a table of 
contents is a first leve3. agglomeration of first level 
entries; and an indfix (to a book) is a second level 
agglomeration of first level entries. 

There are at least two other factors that must be taken 
into account: a cumulative index to a series of books 

on statistics obviously plays a different role than the 
index to an encyclopedia even though both are third 
level agglomerations of first level entries. The differ- 
ence here is that the encyclopedia is itself an agglomera 
tion of second or higher level access mechanisms, while 
the books are primary information stores. The difference 
in these two mechanisms would almost undoubtedly show up 
in the slope (in the Mandelbrot sense discussed earlier) 
of the index. 

F inally , tliore are access mechanisms clearly dedicated 
to **non-sub ject” access, e.g., author indexes, list of 
publications by publisher, place of publication, time 
of publication, etc. which play a major role in library 
access systems. 

Consider a collection of titles — such as book titles--of 
items which corpass a range of subject matter. The card 
catalog title list is one ordering of such a collection. 
If the collection is reordered to bring together all 
titles which contain a given information bearing word, 
then access to the collection is significantly increased. 

Studies of such access mechanisms have been underway for 
some time , although none of them are generally available . 
One of the most advanced title access mechanisms is that 
prepared at Princeton University under the direction of 
J. W. Tukey; it is a sophisticated permuted title index 
consisting of more than 25,000 titles of journal papers 
in the field of statistics. Since the average length of 
a paper in mathematics is about 13.8 (normalized) pages, 



151 

142 



a title represents a compression of about two access 
levels, for the title as it appears in a permuted title 
index carries information about the journal and author 
as well, requiring about 130 characters. A sample page 
from the Princeton permuted title index is shown in 
Figure 6.1. 

General considerations suggest that a permuted title list 
of book titles for the Library of Congress letter class 
subcollections o^ rchival 1 ibrar ies would be a useful 
tool , and one whi. would be readily obtainable as a 
byproduct of the existence of a machinable catalog 
data base. 

Another type of amalgamative access mechanism, which 
provides access to a collection of items belonging to 
the same access level rather than to only one item can 
be constructed by performing the process normally used 
to construct a standard access mechanism on the output 
produced by another. For instance, we have studied the 
utility of indexing abstracts to journal papers in the 
statistical literature. The abstracts are normally 
provided with the papers; they have been converted to 
machinable form and an elementary version of the indexing 
algorithm described in Chapter 5 was applied to them. 
Appendix A5 exliibits the abstracts to 50 papers, the 
associated abstract indexes produced by application of the 
algorithm, and a cumulative list of the resulting index 
terms with references to the articles in which the terms 
appeared. We reproduce an abstract with its index as 
Figure 6.2 and a page from the cumulative abstract index 
as Figure 6.3. The abstract index was the first processed 
in this series; it is perhaps not entirely typical of 
the output from the algorithm . We have also processed 
the same data using a variant of the algorithm which 
ignores in its analysis stage the presence of the preposi- 
tion "of" and consequently will produce index entries 
1 ike "basic limit theorem of renewal theory " which appears 
in Figure 6.2 only by way of its constituent phrases 
"basic 1 imit theorem" and "renewal theory " . 

A.n index to an abstract is a hybrid form of access 
mechanism . The abstract already contains a large propor- 
tion of significant phrases which are repeated in the 
extractive output of the indexing algorithm. There is 
therefore no hope that an index to an abstract can provide 
a compression of a full factor of 30 that v;ould be 
necessary to descend from one access level to that immedi- 
ately below it. In fact it appears that indexing abstracts 
leads to a compression of about 15 ; since this is 
significantly greater than (2e ) , such a procedure does 



O 

ERIC 




143 



Jb’igure b . i 

Permuted Title Index Page (Left Hemd Side) 



'■iUL'* 


i i :,) 


i. 


.> i / III 1 u 


1> 1 A 1 4 J 1 i t A L l > A 1 A — U rN . i 4 ^ ^ i N\^ i • ■< 1 1 JL t • 


,;.0j K 4 1 


~T7:iTr'rr 


~T 




oil 7» 1 1 I 1 t L * V ' V ' •• L 4 1 ^ 1 * M j L. f 0 


yb IN 


jouPp 1 A 




't { t L rJ 1 KAt 


:>0UAKLb ui VAKl^ilLj tl l u 


H ♦ s res 


~^X'Z C J C A 


" 


IS 7 DESIGN 


IN * FruLC‘ rxFrRir<LNT at iMTCwnrsTROori 


V /S b !'• 


1 >*t 4 iv* 


J 


11 


Cl A I sU L L 


<;U uOk 


oTnnTA 




t:.t- 


TilU ul .^TRl OUT rwr: ^Jl RtNu MULE’S w>C jKi 


cl U IS r. 


i •> A 




t.'.‘ 1' L 1 U t I'iu 


V Aw 1 iS bt HK i J V i Wo ^ .1 1 AU L W UP o i* c iH t' ul' E 


cl Vrfo 


ITSiii.A 




i f 


FTlXCT 



IjC u 

c ^ L J i < J b h >-» 
~1 'j j»\i' A 



o o T r A i M I J'v u 
^•sTTMr 

1-^7“ 



ftH. 



; L I’. i N>j 

- i) lC'~ ^?nXJULV.IFf 



'rSTATT 



bv Vt:.S 



‘.t jU K A 

TTbwh A 
3bT I lU 
i u R A 

.Ki MmI 
i 3 3^4^A 
i b NK A 



1 



V t: 

‘VAri *' 

i. i t. l i 

■DC‘~vCKKA-aiTiru/' "“irrr -ijir - sfktt:; i NTi-'TAr; 

L L Ih r u L f b I L U t u V L »A L o '.n'' 

cTTi nTTUTTrc fvjL r sl is* r* rnrcTrTP 

i >CHK i JUi: b I bLN»'KOL 

FKuLLliWG CLPAALD Uif HLT ulT. I Di> f LOl VA^ ' 
VhiN Alt I A/U>iPA ^br; IN Va/J h'Ji'JLlit-r i\y\H L L O 
i jsimGcac rjprRS''TA?r nri ' LCVLNoU’F^nrKKauLr./ - 

bK I L.tsn 



i'L I 
DTT“ 

i r. 

riri" 

L U* -. 



.. ;. *w» 'N 

V L N b 
5 ^ • K G G 

b » f\ K 

uTvGH 

b 1 V N li 



lO J 

I 

S V 

d j^U 

•I5‘9 

4iti i 



1. «. K 

hi 1 
L LI. 

A UU I 



UK :> 
u3."’.KA 
T5aUMj* 
AbbtH 



"TTuTHTTr*^ s ;> ) 
b u li M I A t» .5 b 
iDEPHT" ' h3 3‘ 
IbPbCA '/b 

'!T7 RCL‘ '■ TltP nT~ ' “■‘''>7 a ‘ 
u ii‘ /'i L L if. / H b L i '■V / 

‘T3CM= — ifttm rnr 

iuHKiN /i>hlA A Lw 
■AvTNirr 7P5FCA' 37 
^/LLtLbi bA bbb 
*5?a RT ■ ■:> 2 J*A b"A « - ^ 3 3 ■ 

0 3hUt>i IvIbhlA iUU 



NVrfiKHAL UiblKiuUl iOh 



SI A Vf bTl bChi. TTFTTtUTnill VUb K 

aITH CuV*us I j-nA 

A 7 A i TL'r. ' AT>* u * ■ 



TET>iT“‘Fh:urr A" 

KkLCTLJ F UK 



”77n 

Kcb f K iu I iOh 

Tt nm iiL wlkaL 
^ iulLNr 

■■ijpFCR-LOwxr:~xir 



CGtTnClXiTI' UArCULATUD^'^F 

Vi‘ I'.ANOL C 



TTcl 

i i A 

■"RR- 

KC7 



ttfatt; 

hlTftods*' ur r Tun VAL idati gt4 



7n7“ 

AN 

aKC 



TTDTTkV rTTnUir' 
OiLKV 

Til KITS ' * aTP A‘5 ' 
•u i ^ I A*:. / i» ^ .t bN 
TjrrTiT^I iFT bF 



J b j i L 

JUL i-ir; 

UTTrr 

t JklK 

'TTunrr 

6 ^>,LO 



"TTFTjTr 
J b Vi>. \ 

/ 4 Vi V u 

^TTRTrr 

buhl. 4 



'■'lol 

Ci ‘4 Uhk 

:37 

~[TT 
:> / 

-- 



KEOTur^u-^Or T;^ AWG posit ivity'' TN Tf»T 

F bvjUANLj < uiN riAUdUh »*iuuLL:> UK 0/ iNU ) L Lub 

■SUTfATZ c R''TTTrLcRjrK 



ui h 
N- U f 

LUK'Tfc'" 



U7^ o bUTTA’FZ t; AT TTirrUTTITiJirF iTtTTTTT TT^i 
OL b f I *Ul4b uLK VAK i A*yi4 KO .-iPuiNLiH 1. N ih 

LlHITrNGnD7"JTKTbaTlC?4'5 ‘ u7V ’ u I COMPACT 
OUbt -LltAUuriUK t MAlKlLcb U H IML hLLh L {' 






AlUl, 

UULKY 



■■"utS 

3b. 



CAR A C T CRUS G C 5 G RCl/F C S 

t i In I L In f Ns L I A 1 4 O !\ 0 1- K #-u-. l 

L R is tttlut;g“ 



TUTU 



L/4 K 



TUUrt r- kT "U N j " b I A N u a KUATnT c TV' * U I h t: B U^ tlTCK Uh i? "Xu k' 

LhUL IChb J Ot 4 bt: U b f b Til 
7u K i. b b 4 *J i'i b R c L H /s \J INU ADF uTc uiCTui'3UC7Hui40 uc<s 
/ u YUN bAHL.>4 AUP i KL I L»<uLn tXu 1 A ![ iu wL in UhU illivL. 



4:u w 4 


i ^" l h 1^ 


A 1 ^ u 


3/SRNi 


2 \)jt\ S A 


N •> U 


o j u t i 


*« 0 *\ hiH 




S'f L N 


i bt A 


.• '1 7 


TTcKT 


U E j 


i-5 / 


OlUUI^ 


4 J P b t 


'/ 3 



^ P At IdiN-ANALYriC 



HULL A iiic^oy.A STA"nj.llXA [7TUI7^ 

I H L R i: L A n V L t U L K b Ai i V 

mu in “n VT ucp) urRHiG 

blUuV Oh tKLAI iVL-lHiNKINu 
FACTO Rrr AT- 3TCn77 uF “TTRFTEHTrrTCAC 



MLAbUKLHUa Oh 



uT 

t L AKi\ i iNO 



AWU ML U 4 At 



■ U’JluWii 


4 . 0 P ^t M 


n — — 


A f aC 1 0t\ AlNMttoib UE 


riL i\l nL 


• t J K S K 


4SPS ti» 


L'J.i) 


VAkIP'iAX biJLUl iUS ^Or. 


Ph 1 HAk Y 


MfcNl At 


: U M b U 


> K U A 


4. bi 


" rpTU Rt t AFl UiT u F 


FiUHARir 


“Rirrrnjr 



loP'jtA rZ oliON hOH 
A j u 0 N oFSTCT P?T ArTAt V b 1 b uF 
Sbir<KY cjPciLA I :> I 



niuKsroNL**b 

-^ rng n sum t’^b 



OhiulhAL PkIMmKY hLhIAt 
b 1 X I tc r r - rK i riAm’ — hrcrrl-Atr 



jOIiAN 

JCiIhL 

er|c 



“TFTTCa“ 
iPbt A 



J iU U I ' III UH Zi 

HtU Oh iwTAUUN TO I HUKi i OMt»* :> 
TTFtTt F At f uR I At 0 Mfc f no u ^ 



PkIMAKY MtiST At 

Tutnr 



Ta The srojf — uh—KvTtxnv 
Thl' ' hACIOKl At I sot AT i ON Uf PKIMAinY 



o ^ 
iV j 



153 



14 4 



Figure 6.1 

Permuted Title Index Page (Right Hand Side) 



O 

ERIC 



i I (Mi;, / I 1. /V r I 'Ji'4 j V..1- L L Li.. 1 <• O I' I L i.. i j I 'U. 

.>■ L 1 *L H'. i.i I I'. ... 1 I j , / ( Ji'iLliUl'i 'J (■ ». t I UH r L. i.) ji.yi'i Ul' f'U;N~ 

:>. Af-iUL'v. rut. CFFicii-rjcY or- 

j • /• i L iv j ~ b I A 1 1. .1 ■' M u u r J I /V i 1 I 1 U J • 

■"i F OF. F A lA ' " F r I [ u is ATiK T f.CS • ' 

M A r. •. ) L .> I l; 1 ..) I < « o 1 . .W * u I. b H 

A < \ . H U. >11. L 1 J |i. i. VI. ‘ \ L'l. L 1 is O L 1 4 • 

A.* t <r A J o VAA rt.iACriLb t.Li'J oKOsJl AANT aL »*A Ai-;i‘<Ln I .'lu bU i 

'YA^ FOnrJOrcF mA,T In — KtCFili'i; «tTi lJU U- C OLOLWEnS. ' 

A.oljLUi I liNO ui' i.LO VaK l AullL-AtNALYbt.o 
■■n.A.\T AL r.icar-13 1 3 rrrn r t“" i - - oc pA/^ta- i;Ar.:cco*RK rt. a t 1 1 3chc; / • 

A - W j I A L N A N j I.. * -i a 

ATrTTmrC PTA TF!! Y a o ~ ^ V L K 0 L L 1 J IS I • ^ j V A i-4 

AauW.L i'IuuUi? mM. L IJI'. L v<A Ai-l'iLi'1 iNOt N li I J 1 ul PA3 1> 1 lib VAN 
AAT.jTTvL'bTEIf'ErRTTrr CKE E0TC3 V A“3PKnalNC VAN Lin NOKnALL' V 
Aai.TaL n a AK 111 M i N(> bb I I K(./ j T t i'i a E L i\ 0 P K N i nb b 

7.AuTCKLrn7vrjOi 'irrj “YJF- HrFZTLwrNc ■TAN--uF"7-Y:Tircai i:k uek pp. - 
aAi.VuLK Eli OH.bLMb VAl'i UUI'-UL MtLN 1 i'J L L /'! .'‘.c L m- i' U OUb I ENi- A 

A A t i W 1 J iTTYI V I'a J I r U 1 J k k K b a 

lU ILr.7b U* bill.. ni5M LUVAKiAVbi. HaIinIa Aa /In'j in 

'■AOAC r jK TC.VrTTTG ■■ THr‘'3^IOr7lTICAnCir 'Or KKiO. 

Au.ib l-bii I ML b/vNI'LL. KANbLa 

A3AC3 FOR THE RAPID T' STTHATtON ' CF “ A TETKACHCRIC CDEFFIC 
Xui‘ib .> ( OK u(.. I L Ki'i U ; A I I ON or A COKkLI. A I I vN l.oL P P i C 1 I 1 1 l.b 

“mjTrc r ok Dim i-o s 1 1« i mu i ml m l a n u l v i a r ion n a cttatso fitc 
Ao^^L I 0I\ D L I L K M 1 K i ."ib I HI. PtsUbAbLE tKRbrSb Lb AK L L A I 1 ( b 

■ AiiACO FOR ITLH-TL 3T' C'DrvRrtATION AND OKITICAL, RATIO OF 

AliPO^bON iNo AN LXri.KlHL-Ul Pk J OK 10 LOMPLLllbNa 
ADuKCVlATCn Cl./bf: WORTH 73.riD' GRAK-CTHARLIFR 3FR1 F3 . THi! 

AobKKVlAlLU ('KbbLUbAK » UK OtlUVlNb EXfuClAnOUb UP .SOMb v. 

A • b • u i nrrzir® a a o - v lttditctti 

A a 0 .0 a-l AK. I bKt.rWLKbUUI H II UNGLtlbMLN KL Ao bt. NF Kl. s, U L nZ LM . 
'AuLLrAN DROUPS;" ' 

AoLLIaN bKUbP CHAKAC T L.K3 a UH IHL bbHbiKJbllO 

■AolLIEN.' /ACTUH I FL3 rRACTrONITTD-'TT ' CERTAINS' CDDlS CORRECT 
AOt. RKAnI bUMPAH I iUNS ) 

~ALUa/'ki4u{. in 1 l I b I ' II 1 K i jb I i Ai- I L 1 bML DT7D FL i-rli 1 1 3L i iL uL S Ai 1 1 Mb 1 T 
/Vbb AMi.- .jOKUi-iUiiu VUM AK I I L <ib L J L Lb .'>b f lAF r 1. 1 1 a 
'■■AuDRcfiZUMG C. IH3E ITT DriE 'TCC'ETwINnrrRcTCITC ' HIT ' HI LFC “YON HI T ' ' 

AUHAHsjI bl N uLulv Abu 8 UNuL (s a 

" A13i FAiiG I GKFTT DcY"' '«T I TG PF UNbrC T S TUNDTri-YRTTr' ACT CTT b"i 
AbhANblbKLl I VUN hLh AlJ^AflL UtR IMFlAL ibLbl.N Ot.3 .>/ 

Ao I L I Fa Uc I uibbMi btsi mm LOk 3L mC bAbuPpO. 

AblLlllL'i ur 3b«b I l;H— Y tMK—UbU TWiNS* 

•■■'ADiL iTrrs."' 

Abib I riLG. 

■■•'A-oTUlTITSi 

Auibll L t. ^ m 

Au IF. I TIC S JVriD P CkbUNALl T Y T tvA I TS* 

AulLlT It-o. 

"ADTLTTIXT — TO ACT! V ITT PRITFrirENCCS'. — 

AulLillbj Ibil U«'rihKY. /KbVlSEO OKTHUbUMAb KurAllUNAL SJL 

— AtjT L"t ■ Tttti — tt 5 I -S' a A TAt TO tvt A t — 

AbimUS SlUbY. APPbIbAllON Of IHL' wUAKIIHaA Hbf 

An IL in cs. T-fr r m. - Appb iDAr t oN — of - w o t — 

AUiL 1 I IfcA. 



1.Cfd 



145 



UllLUa fgLLtfi 

A!L l>m‘QVCTJL ^ AN C ITS AP»>LICAT LQ/i 5 m -V wLUi^ t . J . JH k.tD 

J. SrUN£, UCIA 



Figure 6.2 



Abstract and Abstract Index 



u 

‘.:r 



! 

; 

; -r 

' < 



I 7 - 

i:^ 

. XI 
O 
t X 

! >xi 



iT. 

fc**# 



'U 

> 

■•C* 

X, 



5i 

o 

a 



• f 



o > 
x; ^ 

Ui 

3 

JJ 

' 3 
3 
•<) .0 
•u -m: 
■jc; ►- 
>jy ;z 



j: 



'A 
O 
O 3 
X 

H* 

0 

01 X 
Ql 

K > 

iJ X 
X ^ 
3 V- 

U ^jJ 

X X 

^p»> Ul 



• X: 



! ^ 
i «> 

} X 

i 



w? 

Ui 



' UJ 
; O 

1 ^ 
I X 

1 ^ 



x: 

13* 

<t! 

ol" 

0‘ 

(XI 

5 






' t3 



I ^ 

; :3 



• 

3 2T 
ujj 3 

Xl 

>•! a 

OJ 
*wi| <x 

•X>i « 
•<1 

W! 3 

0| — 

■X 

ql:3 



X -n 
X 

*« 

XI -X 
3 UI 
1.^ w 

•W < 

•3( X 

32 

X) X 



Oj 3 

U. X 
Ui r- 

iX 

CL X 

I'Jt 

Af -< 



-'J 

3 O 

.s ;r 
.11 

U *0 
> uo 

.X Jli 
X 

x:^ 

3 

3 V? 

ie »«. 

X 
o*f u 
-J -J X 
>- (ii <3 3 
.3! •*-< 3 

Uj QC 



(X( u. 



I 

«: 

3 ql; 3 31 
a: : ^ 

Ql U ULj 

X 



3 1 



^ Xj 
OC 3! 3 



3 -i 
3 3 
3;UU 

ax 

(JC I* 

XI <c 

“^1 uu 



a» jj 

•3 < 
XX 

Ho ' 

cal 

gcio 
Of X 
O r 
u 



o 

>-4 <i •> 

3 Ui iC 

j ^ 

ui 3 H-j 

X; t 

013 

VI Ui ,3 






3X3*: 
>• 3 Oh. 

3 ;/1( »m4 ci5 cl 
L- V- CL! 'X 
3 X, 3 i X 

Mi .U :0 3 O 

X! *-< Ui 
O Ji X «J 4i 

riC s-iJ H X 



5 

cc> 



>• 


;Q 


o 








< v> < f- 










1 




1 
















h- 








fPM 


1 




1 






1 




) 














Q 


X 


o 




'.a 


VI Q lU « 


2 


1 

1 






i 




! 










1 




Ui 


•X 


W-* 






UJ O'-* 




1 






1 




j 




i 






i 


CL 


fM 








rU 


tOC -a 




o 




i 




[ 




j 




1 










i 




' ^ 




O 




•H -Ui X 




1 

1 


1 




1 




1 










t 


*.a 4 J 


< 4 ^ 


1 




•X 


« 


‘ oc! a 


o 


1 


1 












1 






I 




; oc 


►H 


! -i 




u. 


VI «; (O 


PM 


1 


1 












1 








«-i 4 


'. »«•' 


X 


■«x 




Of 


O o L. 


■ H- 


1 — ! 


1 


i 




1 


















O 


' 30 




ac 


«t 


> X 


; o 


< 


1 






1 
















CL 


u. 


fM 


' O 




Op 




^ ui 


X 


1 










1 




1 










ill 


<« 


i (X 






Of 


1 S 3 


*/i 




i ' 






















ca 


/r 


a. 




V» 


I— 


• 


1 


lU 
























M 


; O 


P* 4 I 


1 




•M 




}.u VI 


{ X 


1 " 


1 






















1 — 


3 


0 ) 


I o 




X 


►m' X Ui 


o 




I 






















3 


x; 


< 






.X 


VI 


2 




[ 












1 


1 

1 










' a 




, ^ 






OJ 


1 ml 


3 












1 


j 


1 






) 


»— 1 


1 o 




1 a 






Cl a ui 


2 T 
















1 






t 

f 


o 


!o 




I w 

'•t 




ac 


►-* o 

UJ 3 3 


< 












1 




! ^ 






i 


Xi 


i < 








Ui 


O.; 


; ^ OC 


X 


VI 
















■ a 










■ p «4 








-i 


o a., 


< 


Ui 














a 






1 


#-• 


’ uc 


d 


'JU 




o 


3 


1 
















1 




X 








o 


Ui 


o 






Ui 


fiTi 3 


1 


H* 
























<x; 






a 




UL 


X 


l> 












Vi 










1 


uu| 




■2 


X 




>* 


WL 


X -^oc 
«• Xlo 


o 

PM 










VI ^ 




i 








1 

I 


i: 






o 




CU 


3 


X o 


ui 


a 










►- P-M 






X 






j 


PM 


.i 




■ 3 ^ 








cj X 


X 


m; 










or X 






UJ 






! 




.X 








•M 




•X < 


fp» 


03 










kLi 


1 




jC 








•/» 


PM 


1 






3 


.• 3 ^ IL X 




o 










X 






UJ 






1 


< 


o 


J 


X 




O 


Uil 


1 <L ca 


o 


oc 










3 H* 






OC 






i 




1 *-• 




•u 








1 ^ 


< 


X 








VI 


Cl X 












I 




!►- 




o 




O 


h» 


xi 


X 










Ui 


C 3 C Ui 






u. 








Xl 


z 




CL 




UJ 




O U Ui 


a 




Hp 




a 


<t a 




z 


a 










1 '.u 




1 C 


• 




’..iJ 


< 




a 








PM 


ii: 


4 ? 


a 








( 


«W« 


1 

1 


J 


1 


s 


X 


X 


£ o 




*XJ 








Up 


U> ui 


a 


PM 


X 




VI 


X 


xi 


iin 


n 


VI 


UJ 


PM 


>*• 


oc < 


! X 


uc 




a 


>• 


>• 


-M a. 


IM 




u 




ill 


OJ 


Si 


> L- 






X 


3 




O 


1 


h» 


• 


PM 


X 


oC ^ 


H» Uil ^ 


3 ‘ 


X 




PM 


S 




' *-f 


»iii 


' >- 




CL 




X *x 


u. 




VI 


Up 


CJ 


a X 


VI o| 


1 « 


'O’ 


a 




♦«-> 


a 


•i 


1 


Cl 


<a 


1/1 




'jJ 


vu 


a 


o 


X 


PM 


Ui 


OI < 




i ^ 


•M' 


JJ 




PM 


Oi 


X 


: w-p 






c/> 




‘U H »1 


z 


•M 


z 


Xi 


! X 


a H 


PM 


xl 


X 




a 


xl 


o 






o 




> 4 J 


H- 


X -«( X 


tM 


< 


3 


fcpp 


H* 3 


Ml ! 


! H H*l 


N- 


./) 


PM 


M 


PM 


1 ^ 


Ui^ 


ui 


o 


Mi 


h- X UJ 


cc 


X 


PM 




n 


oi IL 


a vT 


-c 


:si 








X' 


O UI 




(XI 




lOC 


'JJ 


o 


VI 


>• 


2 -• 


O O 


oc 


«M' 


>• 


•M 


< 


HH' 




t 


►•! 


1 < H»! 






. X -51 O 


> 






w 


O cX 


GC 


(X 


oj 


' *M 




(U 




o 


!>• 






GL 


> X 


<S O •*«.: ai 


a 


HP 


Up 


•M 


— « 


Up Hp 


CL 


1 


1 X 


X 


a 




Ui; 


; 




> CL 




( 




lx 


o 


X 


a 


o 


h- H* 


2 




a^ a 


oc 






i u; 


!*i<l 


i Ui 




1 

» 


Ui! VI O ►- 




Ji 




•Ml 


< ^ 


a a 




<! 


1 a 




Op 


a 


3 


1 > 


X 


X VOCSf 


Q|H» 




c/lj 




X 


CO 


3 X 


< c 


a 


PM 










2 


ica 


f- 






:ui 




XI 


|X 


UI 


< 


^ M« 


X fM 


-< 


X 


a Q 


a 


3 


a 






X til 


X 






3 


:a 


aa 


O JO 


3 < 


1 X a 




o 


-4 


o 




H* 


UJ >-*4 




0 ‘ V> Ql X 




o 


X 


O 


3 C 


aix 


z 


VI 


oc 


43 M 


> 11 ' 


1 kp« 


3 


UI X 


o 


27 


|UJ Q 


«i 9 


X 


UI 


3 


SC 


a 3 


^ X' 


ILJ M 4 






< H 


u*l 


3 

1 


tsa 


oO 


»PB 




-*! 


nc < 


0 

1 


u 


X 




fL 


Up V> 

i 


X 


yjs 


o 


« X 


X\ 

1 



v> 

X 



\ X 



O 

ERIC 



155 

146 



Figure 6.3 

Cumulative Index to 50 Abstracts (one page) 



NCNCENTR/JL L lUT R lULT IC^S ^3 

Jt^NCLMTH/lL MULTIVAHIATE tETA UlSIHI&UllON 43 

NLihNULL tIi>lKluU lUNS 

MN»=A«AMfcTKlt, AtrtkNATlVE 42 

NUJ^RAHAMtTH IC ALTERNATIVES 3/ 

NtrtPAHAMEIHlt TESli __ _ __ 10 

NORMAL ALItHNATiVES " J7 

NORMAL APPKU* IHAT lUN _ i 

NORMAL LISTK IBUT ION 32 

NORMAL DISTRiaonON 

NORMAL UlSIfi ItJOTlUNS ‘ 2 9 

NORMAL CISTS JdOT IONS 

NORMAL POPOLATIIjNS ' ' ’ 2 9 

NORMAL rntORY LlKELIRCCC RATIO SlAIISTlt 27 

NORMAL THEORY L IKtL IROOO RATIO TEST STATISTIC 2 ? 

NOLL OlSTKieoTlUN _ 29 

NULL OISTRIBOTIONS 25 

NULL HYPOTHESIS 

NULL HV PL THESIS " 2 9 

NOLL RfcCUkKENl LPAlwS 1 

NUMEER SUCttSSES 23 

NUMERICAL COMPAHISCNS 41 

NUMERICAL EXAMPLES 29 

uNE-UlHtNS lONAL EMPLMitAL PROCES S CO N VEKO fc I 

ONTO itself 47 

operational characteristics 17 

optimal ALLLCATION 2 

OPTIMAL ALLOCATION PROBLEMS _ _ 2 

OPTIMAL STRATIMEC SAMPLINU " 2 

OPTIMUM ALLOCATION ly 

OPTIMUM BEST LINEAR " 33 

OPTIMUM ULUE,S 33 

OPTIMUM NQN-PAHAMtTKit STATISTICS 30 

OROER AfiSULUTE CENTRAL MOMENT 26 

OROER STATISTICS 13 

ORUER STATISIICS 3J 

ORDER STAlisTUS 33 

OhOtR STATISTICS . 33 

OVERALL AVERAGE 40 

P-OIMENS JONAL SPACE 11 

P-OIMENSIUNAL VAKIATE 2 

.P-VARIATL NORMAL PQPWLATICNS & 

PAPER 0£NER|)LI2ES 36 

PAPER treats *-C0MPARIS _ 17 

PARAMfcTtH SET 17 

PAKAMEftM SETS _ 17 

parametric classes 5 

PAST OfiStR VAT IONS A VA ILABLE 39 

PHASE OlSTRlbUTlON ' 3 

PHASE OISTRiliUTION _ 4 

PHASE SERVICE TIME OISlRiELTION 4 

POPULATION MEAN 16 

PCIPULAT ION MEAN 40 

POPULATION SUE _ 23 




156 

147 



.realiiio a compressive gain that may be useful for accessing 
the abstracts. It will certainly be useful for accessing 
the original documents when it is applied to a collection 
of abstracts and the resulting indexes are accumulated. 

The page extracted from the middle of the cumulative abstract 
indexed reproduced as Figure 6.3 shows that one paper in 
the sample of 50 referred ^ via its abstract, to the "NON- 
CENTRAL MULTIVARIATE BETA DISTRIBUTION", and ^ since the 
abstract transmitted this phrase , the paper undoubted ly 
contains something of interest about this topic . Similarly 
note that eight papers referred to the "NORMAL" distribu- 
tion in some form. The presence of spurious terms like 
"ONTO ITSELF" and "OPTIMUM BLUE'S" is no more than a minor 
annoyance in use of the index f. and is of course due to 
inadequacies in the indexing algorithm' s "stop list" , 
which should certainly contain the word "ITSELF". There 
are* other more subtle problems whose genesis is the 
indexing algorithm, but they are not so obtrusive as to 
mae tl^o use of the list burdensome. For instance, the 
phrase "OPTIMUM BLUE'S occurs in the abstract, where it 
is defined to denote "OPTIMUM BEST LINEAR UNBIASED 
I'lSTIMATl-:" ; this phrase certainly belongs in the index, 
hut it is not clear that a user of the amalgammated index 
would recognize the technical meaning of "BLUE" until 
it had become a standard term of the field. 

Indexing abstracts is of potential value in gaining 
access to the large numbers of journal papers which 
annually appear in the literature; coupled v/ith permuted 
title access mechanisms , the abstract index should 
provide a rapid and reliable means of surveying the key 
content areas of papers without the time-consuming 
process of reading abstracts, which often limits one to 
a relatively narrow and current range of documents. 

When compared to the earlier manual implementation of 
the algorithm on Computerized Library Catalogs , the 
machine implementation of the algorithm differs in 
several ways, aside from the obvious fact that the machine 
is entirely consistent in its application where manual 
procedures cannot be . The raw data for the machine tes t 
on the statistical abstracts was keypunched in all upper 
case, as a matter of convenience. Hence, the rule to 
keep capitalized one-word entries was inoperative in 
this run. Further, no attempt has been made to include 
see or see - also types references in the machine imple- 
mentation. On the other hand, the machine implementation 
includes logic to allow structure-word entries where the 
manual implementation did not . 

These differences are reflected in the statistics de~ 
scribing the entry length and page location distributions. 





148 



Tab 1g 6.1 provides the entry length ( in number of words ) 
distribution for the machine index to the statistical 
abstracts. 



Table 6.1 

Entry Length Distribution 
Algorithmic Index to Statistical 
Abstracts 



ERIC 



Number of Cumulative 

Words Frequency Frequency 



1 


0 


0 


2 


315 


315 


3 


233 


54 8 


4 


125 


673 


5 


54 


727 


6 


19 


746 


7 


6 


752 


8 


2 


754 



Comparing this distribution to the comparable distri- 
bution for the manually implemented algorithmic index 
to Computerized Library Catalogs (Table 5-3) one sees 
that the proportion of one-word- entries has been re- 
duced to zero (because there is no logic available to 
permit one-word-entries) and that the overall average 
entry length has been increased from 2.08 words per 
entry to 3.01 words per entry. The main factor in 
this increase is the introduction of structure-word 
entries, although the absence of one-word-entries has 
a small effect on average entry length as well. 

The entry length distribution is plotted on Figure 6.4. 
Despite the absence of one-word-entries , the points are 
nicely fit by a straight line confirming the nice ap- 
proximation by a lognormal distribution. 

It will be recalled that in the previous study of page 
location distribution for the manually implemented ver- 
sion of the algorithm on Computerized Library Catalogs 
there was a significant bend in the Zipf-Mandelbrot 
straight line due to either a reduced number of singly 
occurring entries , or an excessive number of multiply 
occurring entries . For the machine version of the algo- 
rithm the page location distribution (or, more accurately, 
the abstract number location distribution) does not show 
this deviation (see Figure 6.5) • The distribution is 
given in Table 6.2* 



158 ! > 

149 



KEUFFEL 5 eSS£R CO. 



f 

i 

I 

f 

i 

r 

i 



o 




PERCP.NTAGE 




159 150 



Number of n<ar 'Rnfrv 



Figure 6.5 



Abs tract 
Algorithmic 



Number Location Distribution 
Index to Statistical Abstracts 



10 



3 



10 



2 



lO 



1 




Number of Abstract References 




160 

isii: 



Number of Index Entries 



Table 6 . 2 



Abstract Number Location Distri- 
bution ; Algorithmic Index to 



Statistical Abstracts 



Number of 



Abstract Locations 
per Entry 



Cumulative 

Frequency Frequency 



1 

2 

3 

4 

5 



703 

31 

6 

9 

5 



70 3 
734 
740 
749 
7j4 



When compared to the comparable data for the manual 
imp lemon t at i on , it is clear that not only has the 
di f f iculty of an insufficient proportion of singly 
occurring entries been been corrected by the inser- 
tion cf structure word logic, but the slope of the 
line has been significantly increased from 2.17 
to 4.49. This increased slope can of course be 
attributed in part to the nature of the material 
covered in the two cases and, perhaps, in greater 
proportion to the structure of the material (i.e. 
fifty abstracts vs. a single text) . Nonetheless, 
the increase in slope does tend to confirm the ex- 
pectation th at use of structure-word entries is de- 
sirable to increase slope. 

Potentially more useful than the amalgamation of indexes 
to abstracts to- papers or books is the amalgamation of 
indexes to the primary texts themselves. We have under- 
taken an extensive project designed to pi ovide a realistic 
test of the utility of amalgamations of book indexes 
as well as an indication of the problems that would be 
encountered in the preparation of such access mechanisms . 

The indexes contained in 80 books on s tatistics have been 
committed to machinable form. Approximately 30,000 
index entries (not all of which are distinct) are repre- 
sented, which is nearly 400 entries per book. This is sig- 
nificantly less than the average of 838 index entries per 
book obtained from the Fondren Index Sample, but, as is clear 
from Table 4.4, it is well within the deviations typically 




152 



obtained by restriction of a sample to small and 
specially defined subsets. We have not attempted to 
determine the average number of pages per book in this 
statistics sample; it may well be that the average number 
of index entries per page is in closer agreement with 
the figure obtained for the Fondren Index Sample. 

The Statistics Index Sample is currently in the early 
stages of amalgamation. In this report we can only 
exhibit a combined alphabetically ordered list which 
has not been formatted (to reproduce the usual format 
of a book index) and which exhibits the consequences 
of some program "bugs" not yet corrected which result 
in the replication of input records at various places 
throughout the amalgamated index. In spite of these 
difficulties, the amalgamated list is already a valuable 
access tool. 

Table 6.3 lists the books that constitute the Statistical 
Index Sample. The code in the leftmost column is the 
abbreviation for the book used in the amalgamated index. 
These books were chosen by a professional statistician 
as representative of the more important information in the 
statistics field that is available in monograph form. 

The choice of 80 books rather than a larger number is 
purely conventional; continuation of this project will 
increase the data base and permit us to determine how the 
yield of new index terms varies with increasing size 
of the sample. 

Following the lead of the analysis of the structure of 
the index to a single book given in Chapters 3 and 4, 
we see that the rank-f requency distribution Figure 6.6 
is just another form of the index reference distribution 
discussed in those chapters; in the form shown here, 
the abstract entries appear at the top left part of the 
graph, and the horizontal portions of the graph correspond 
to those entries which refer to the same number of text 
locations. Consequently, the abstract entries for the 
Statistics Sample certainly include those that have 
ranks levSs than 30, and may include several more but 
not any with rank greater than 50. 

Table 6.4 lists the 30 index terms that refer to the 
greatest number of pages? personal names have been placed 
in the right hand column; otherwise the order of appearance 
in the amalgamated index list is the order shown in the 
table.* This list is a useful pedagogical tool, providing 



* The frequencies given here are very tentative, as no 
attempt has yet been made to agglomerate proper names 
appearing in variant form. 



taplp: 6.3 



Bibliographical Description of the Statistics Index Sample 



A Elementary Decision Theory - Chernoff 
B Nonparamotric Methods in Statistics - Fraser 

n Statistical Methods for Chemists - W, J, Youden 

D Analysis of Btraipht-line Data - Acton 

E Testing .Statistical Hypotheses - E. L. Lehmann 

F Introduction to Mathematical Statistics - Paul G. Hoel 
G The Design and Analysis of Experiments • 0. Kempthorne 

H An Introduction to Multivariate Statistical Analysis - ' T. W. Anderson 
I Statistics — An Introduction - D, A. S. Fraser 

J Line/ir Computations ~ Paul .S. Dwyer 

K Modern Probability Theory and Its Applications - Parzen 
L Planning of Experiments - Feller 

M Theory of Games and Statistical Decisions - Blackwell and Girshick 

N An Introduction to Probability Theory and its Applications, Vol 1 ^ Feller 

0 Elementary Statistics - Paul G. Hoel 
P The Elements of Probability - Cramer 
Q Statistical Decision Theory - Weiss 

R Introduction to Probability and Random Variables Wadsworth and Bryan 

S Introduction to the Theory of Statistics - Mood and Graybill 

T Elements of Probability and Statistics - Wolf 

U An Introduction to Linear Statistical Models, Vol. 1 - Graybill 

V Elements of the Theory of Markov Processes and Their Applications - Bharucha-Reid 

W Geometrical Probability - Kendall and Moran 

X Fundamentals of Statistical Reasoning - Quenouille 

Y Characteristic Functions - Lukas 

Z An Introduction to Probability Theory and Its Applications, Vol. 2 - Feller 

AB - Elements of Mathematical Statistics - Alexander 

AC Statistical Theory and Methodology in Science and Engineering - Brownlee 

AD Statistics and Experimental Design, Vol 1 - Johnson and Leone 

AE Mathematical Statistics - Wilkes 

AF Experimental Designs - Cochran and Cox 

A I A Course in Probability Theory - Ka i Lai Chung 

AJ Essentials of Probability - Arthur Yaspan 

AK The Design of Experiments - Fisher 

AL Computational Handbook of Statistics - Bruning and Klntz 

AM Design and Analysis of Experiments - Quenouille 
AN Handbook of Statistical Tables - Owen 

AO The Elements of Probability - Berman 

AP Design and Analysis of Industrial Experiments - Davies 
AQ Statistical Theory Lindgren 

AR Introduction to Statistics - Carlborg 

AS Probability and Statistics - Adler and Rossler 

AT Measurin^^ Uncertainty — An Elementary Introduction to Bayesian Statistics 
AU A Brief Introduction to Probability Theory 

AV Statistical Design and Analysis of Experiments for Development Research - Villars 
AW Statistics in Research - Ostle 

AX Schaums Outline Series Theory and Problems of Probability - Lipschutz 
AY Elementary Mathematical Programming - Metzger 
AZ Statistical Inference for Markov Processes - Billingsley 




163 

154 



Continued 



BO Introduction to Probability — A Programnied Unit in Modern Mathematics 
RD Statistical Analysis of Stationary Time Series - Orenader and Rosenblatt 
RF) Statistical Methods in Bxperiinentation — An Introduuolon - Lacey 
RF Stochastic Processes — Basic Theory and Its Application - Prabhu 
BG Probability and Frequency - Plummer 
BH Statistical Methods for Research Workers 

BI Probability, an Intermediate Text-'Book - Rixley 
BJ Rerression Analysis ~ Williams 

BK Statistical Processes and Reliability Engineering - Ohorafass 
BL Introduction to Probability and Mathematical Statistics - Birnbaum 
BM Elementary Mathematical Statistics - Baton 
BN Introduction to Biostatistics ~ Bancroft 

BO Sampling Techniques - Cochran 

BP A History of the Mathematical Theory of Probability - Todhunter 
BQ Statistical Methods in Biology - Bailey 

BR Statistical Theory - The Relationship of Probability Credibility and Error 

BS An Introduction to Multivariate Statistical Analysis - Anderson 

BT Probability and Experimental Errors in Science ~ Psrratt 

BU Contributions to Order Statistics - Sarhan and Greenberg 

BV Introduction to Statistical Method - Ehrenfeld and Littauer 

BW Theory of Probability - Jeffreys 

BX Statistical Adjustment of Data - Deming 

BY Statistical Analysis in Chemistry and the Chemical Industry - Bennett and 

Franklin 

BZ Probability Random Variables and Stochastic Processes - Papoulis 
CD -Elements of Queueing Theory with Applications - Saaty 
CE Stochastic Processes - Doob 

CF Sample Survey Methods and Theory Vol 1 Methods and Apf'lications Hansen, Hurwitz 
CG Advanced Statistical Methods In Biometric Research - C Radhakrishna Rao 
CH Introduction to the Mathematics of Statistics - Robert W. Burgess 




Figure 6.6 



Rank - Frequency of Reference 
Distribution 
Statistical Index Sample 






165 



fv 






156 



Table 6.4 



ABSTRACT ENTRIES FOR THE 
AMALGAMATED STATISTICS INDEX SAMPLE 



Normal distribution 
Binomial distribution 
Poisson distribution 
Degrees of freedom 
Conditional probability 
Standard deviation 
Analysis of variance 
Distribution 
Chi-square distribution 
Central limit theorem 
Least squares 
Variance 

Correlation coefficient 
Median 

Cauchy distribution 

Covariance 

Independence 

Random variable 

Exponential distribution 

Gamma distribution 

Moments 

Bivariate normal distribution. 
Multinomial distribution 



Fisher, R. A. 
Student 

Pearson, E. S. 
Kendall, M. G. 
Bartlett, M. S 
Cramer, H. 
Neyman , J . 




as it does an immediate and objective overview of the 
important subjects in statistics as well as the important 
contributors. It plays the same role relative to that 
portion of the field of statistics represented in the 
monograph literature that the abstract entries for the 
books described in Chapter 5 played; and it increases 
the degree of information compression as well. 

Figure 6.7 shows one page from the uncorrected form of the 
amalgamated Statistics Index Sample described above. 

This page has been selected to include the entry "log 
normal” and those related to it. Observe that six books 
(coded P , S , AD , BL, BU , CD) contain references to the 
log normal distribution; since this represents only 7.5% 
of the books in the Statistics Index Sample / the unsophis- 
ticated inquirer will realize a very significant saving 
in search time with. a reasonable degree of assurance that 
most of the significant references will either be covered 
directly within these six books r or more comprehensive 
treatments will be noted in their bibliographies . 




Figure 6.7 



Sample Page from the Amalgamated Statistics Index 



AP 

e 

UG 

L 

k 

E 

CG 

bH 

UU 

b1 

BT 

BT 

BT 

BT 

bT 

AU 

AM 

AP 

E 

AT 

BM 

I 

AN 

AN 

P 

BP 

S 

BU 

CO 

CO 

b 

AT 

AT 

kL 

AU 

AU 

P 

BL 

AT 

I 



LOCAL bXPLURAUUNt503 
LOCAL MAXIMIN TEST 1 
LOCAL OPTIMUM PKOPLKTlfcS OP 
LOCAL PKObAOlLiTYt'^4 
LOCALLY COMPACT SPACES f I t242 
LOCALLY MAXIMIN TESTf3^9 



TESTS*114«L39t32Vt34^t34b 



LOCALLY MOST POWERFUL 
LOCALLY MOST POWERFUL 
locating AVERAGESt 2 S 2 



(LMPJ TESTtI39t342 
UNBIASED TESIS«2Ba. 



LOCATION 
LOCATION 
LOCATIUN 
LOCAT lUN 
LOCAi lUN 
LOCATIUN 
LOCATIUN 
LOCATION 
LOCATIUN 
LOCATIUN 
LOCATION 
LOCATION 
LOCATIUN 
LUCATIUN 
LOCATION 
LUCATIUN 
LUCATlUNt 76 
LUCKEtSOO 
LOG NORMAL 
LOG NORMAL 
LOG NORMAL 
LUG NORMAL 



ERROKStJOI t306t309«3i0«3I2t313»3i4 
INOlLESt SLE ALSO BEST VALUE 
ALSU 
ALSU 
ALSU 
ALSU 



SEE 

SEE 

SEE 

SEE 



MEAN 

MEDIAN 

MODE 

MOST PR0BA8EE 



INOICESf 
iNDICESt 
INUlLESt 

INOICESf SEL ALSU MOST PROBABLE VALUE 
INOICbStTG 
MEASUREStlS 
OF EXPEKIMLNTSti^II 
OF STATIONARY POINT t 503-4 

PARAMETER FAMILY OF DISTRIBUTIONS IS-SXOLHASTiC<|i^4LY. INCREASING* 7^ 
PARAMLTLRf 1.44,1 79 
PARAMETER, 62 
PARAMETERS, 44, 134 

TEST CRITICAL VALUESa499« TABLE ia*4 
TEST UISTiUBUTIUNS, 499, TABLE l8o4 



UISTK1BUT10N*I32 
0ISTKlBUTIUNt62,66 
UISTKlBUTIONsbS 
DISTRIBUTION, 65 
UlSTKiBU riONS«I3Z „ 



LUG NUHHAL 
LUG-FACTOK,B7,133 
LUG-LIKELIHOGO,B7 
LUG-LIKELQHOOO ,4,24 , 3B ,46,64 
LUG-NUHHAL DISTRIBUTION, I 15* 117 
LOG-NORMAL U 1 STRIBU T ION, 1 15* 1 1 7 

LUG-NUHMAL UiSTKlBUTlON, liB _ 

LOG-NOkNAL U1STKIbUT 1UN,95 
LOG-OUOStB7 

LOGARITHM OF COMPLEX NUMBERS *532 



AU LOGARITHM TABLE, 2U4-2U5 
AU LOGARITHM, 190*191 

BM LOGARITHMIC CHARTS DOUBLE LUGAfUTMMlC PAP£R-»lT*ia 
BM LOGARITHMIC CHARTS SEM I-LUGAR 1 1’HMiC PAPER* 16* IT 
N LOGARITHMIC UiSTKIBUTI UN*291 ' 

t LOGARITHMIC 0 ISTRlbUT 1 UN * 62 
AU LOGARITHMIC UISTRIBUTIUN *69 
BX LOGARITHMIC FORM GENERALIZED HYPEKB0LA*204 



BX LUGAKITHHIC. 
BX LUGAKITHHIC 
BU LOGARITHMIC 
T“ (y-SARlTHHIt 




FORM OF THE EXPONENT! AL^191, 19 3 
FORM SPECIAL REMARKS* 195* 196*201 
NURHAL DISTKIBUriON* SEE. LOG NORMAL /Ul.STRIBUT ION 
PH06AB1LITV PAPER*Vl 



159 208 



Appendix ^ 



Abstract Index Entries : 

A Uniform Sample from the 
FOndren Index Sample 



Marston, Wi.13.iam Moulton 
Integrative Psychology 

Sum= 1432 / 29.54 = 48 

23 Marston, W.M. 

17 Freud, Sigmund 

15 Watson, J.B. 

13 Cannon , . W. B . 

11 Adler, Alfred 

10 Desire 
10 Jung, Carl 
10 Libido 

10 Woodsworth, R.S. 

9 Compliance 
9 Passion 

8 Allport, F.H. 

3 Behaviourism 
8 James, WI4. 

8 MacDougall , Wm 

7 Herrick, C.J. 

7 Satisfaction 
7 Sherrington, C.S. 

6 Captivation 
6 Dominance 
6 Psychoanalysts 

5 Carlson, A.J. 

5 Erotic drive 
5 Inducement 
5 Passion response 
5 Visual discrimination. 





25 

BF181.M3 1931 



4 Angell, J.R. 

4 Archtypes 
4 Cell body 
4 Compliance, motives 
4 Eng, II. 

4 Hering, E. 

4 James-Lange, Theory of Motion 
4 Law of integrative sequence 
4 Origination response 
4 Passion motives 
4 Submission 
4 Trolaut, L.T. 

4 Unit responses, compound 
4 Washburn, M.F. 

4 Yerkes, R.M. 



ubstances, hypothetical 






i 50 

McLean, Archibald BV2532.M3 1921 

The History of the Foreign Missionary Society 

Sum= 498 / 29.54 = 16.86 

7 Fallen, The 

6 Moore, W.T. 

4 Nurses being trained 

3 Bilaspur 

3 Johnson, Miss Kate V. 

3 Loos, C.L. 

3 Moore, W.T., Quoted 
3 Rijnhardt, Dr. Susie 'C. 



75 

Coo3.i.dgo, Archibald Cary . D443.C6 1927 

Ten Years of War and Peace 

Sum= 415 / 29.54 = 14 . 

31 Great Britain mentioned 

20 France, mentioned 
20 Poland 

19 League of Nations, mentioned 
18 Versailles, Treaty of 
16 Wilson, Woodrow 
15 Hungary 
13 Algeria 

13 l?ughe.s, Charles E. , Secretary of State 

11 Gtu'iuany, moiltionod 
11 Harding, Weirren G. 

11 Morocco 

10 China 

10 France, estrangement between cind Great Britain 
10 Japan, mentioned 
10 Rumania 



\ . 




Sackvi.llG-Wont , Victoria Mary 
Knole and the Sackvilles 



100 

DA690.K7 1922 



Sum= 307 / 29.54 = 10 

7 Sackviile, Lady Margaret (afterwards Countess of Thanet) , 
mentioned in Lady Anne Clifford's diary 

4 Popys, Samuel, quoted 
4 Walpole, Horace, quoted on Knole 

3 Devonshire, Duchess of, his (i.e. 3rd Duke of Dorset) letter to 
her 

3 Dryden , John, his debt to 6th Earl of Dorset 
3 Gorboduc 
3 Macaulay, quoted 

3 Sackville, Charles, 6th Earl of Dorset, songs quoted 
3 Sackville, Lord George, quoted 
3 Wraxall, Sir Nathaniel, quoted 




125 

Siicrrarcl, Philip DF521.S4 

Byzantium 

Sum= 1643 / 29.54^ = 1.88 

10 Cliurches: in Constantinople 
10 Frescoes 




1966 



150 

DS423.C85 v4 1953-58 



Institute of Culture 

The Cultural Heritage of India 

Sum= 4906 / 29.542 = 5.6 

51 Krsna (4rl) 

« • • 

46 Siva 

41 "-Bhagavad- Git A” 

37 Visnu 
33 Brahman 
32 Guru(s) 




E178.6.S3 1965 



I 

f- 

S«.f 



m. 



savcth, African Past 

Understanding 

.o / 9Q 54^ = 2-24 
1958 / 29 . 

1 ^ A and Mary 
28 Beard, Charles 

20 Jefferson, Thomas 

m «cr Frederick Jackson 
17 Turner , ^ 



O 

ERIC 



Wy 






200 

Link, Arthur S. E741.L55 1963 

American Epoch: A History of the United States Since the 1890* 

Sum= 7016 / 29.542 = 8 

14 Prices; agricultural 

12 Foreign relations: Anglo-American 

11 Federal income tax: individual 

11 T{ix: individual income 

10 Farmers, income of 
10 Legislation: agricultural 

9 Agriculture, legislation for 
9 Railroads: rates of 



Bolton, Herbert Eugene 
Anza's California Expedition 



225 

F864.b68 v4 1930 



Sum= 648 / 29.54 = 22 
132 Mass 

111 Anza, Juan Bautista 

,60 Garces, Fray Francisco 

46 Monterrey 

44 San Gabriel Mission 

2 7 San Diego mentioned . 

26 Colorado River 

26 Ribera (Rivera) Fernando de. 

26 Sierra Madre de California 

23 Apaches 

21 Palma, Salvador 
21 Sierra Nevada 

20 San Miguel de Horcasitas 

19 Eixarch (Eyxarch ) Fray ThomAs 

18 San Francisco, harbor and settlement 

17 Mexico 

16 Spaniards 

15 Christian Indians 
15 Fages, Pedro 
15 Gila River 

14 Pablo (Captain Feo) Yuma Chief 
13 Crespi, Fray Juan 

13 Rio de San Francisco (San Joachin) 






178 

*5 






250 

De Garmo r Ernest Paul IIBl99*W5y5 

Engineering Economy 

Sum= 650 / 29.54 = 22 

5 Terborgh, Gczigc 

4 B.rcak even charts, examples of 

3 Balance sheet, example of 
3 Doferred--invostmont studies, examples of 
3 Minimum cost point 
3 Personnel factors, lighting 
3 Rate of return, determination of 
3 Selection, of design 
3 Survivor curves, examples of 

2 Accidents, effect of lighting on 
2 Annuities whose present value is 
2 Borrowed capital, cost of 

2 Bureau of Internal Revenue relation to depreciation 
2 Capital gains, and losses 
2 Capital gains and losses, carry-over of 
2 Capitalized cost, example of application 
2 Costs, accuracy of estimates of 
2 Costs, labor 

2 Depreciation , sum-of-the-years * -digits 
2 Hoover Dam 

2 Income and expense statements, example of 
2 Income taxes in public utility studies 
2 Increment costs 
2 Labor, turnover of 
2 Life, economic 
2 Life, useful 

2 MAPI replacement formulas , forms for use in 
2 Material, selection of 

2 Multiple-purpose works, evaluation of benefits from 
2 Overhead expense bases for distribution of 
2 Plant location, economy studies of 
2 Power factor, effect on utility rates 
2 Rate schedules, block demand 
2 Rautenstrauch , Walter 
2 Risk, factors affecting 
2 Selection of methods or processes 
2 Self liquidating projects, relation of taxes to 
2 Self liquidating projects, repayment of capital in 
2 Wage payment, piece work 



.O X 



1960 



275 

HD20.C554 1958 



Chorafns, Dimitris N. j 

Operations Research for Industrial Management 
Sum= 141 / 29.54 = 5 

17 Charts on simulated business results 

11 Computers usage 
11 Simulation 

10 Allocation 

« 

9 Managerial decisions 



j 

I 



ERIC 




180 



1 







Smart, William 



300 

HF2046.S62 1904 



The Return to Protection 

S= 201 / 29.54 = 7 

20 Chamberlain, J. 

20 Germany 

19 Doard of Trade 

14 France 

10 Gif fen, Sir Robert 
10 Shipping 

9 America 

9 'America and protection 
9 Canada 











\V 



Cole, George Howard Douglas 
Social 'Theory 

Sum= 382 / 29.54 = 13 

22 Trade Unions 

15 "State, The" 

14 Associations 



325 

HMG6.C7 1920 



\ 



12 Churches 

12 Functional Equity, Court of. 



organisation 



10 Law 
10 Rousseau 

9 Sovereighty 

8 Function in relation to individual, perversion of 

8 Marxism . ' ^ „ 4 

8 Will, as a basis of Society 

7 Middle Ages 
7 Parliament 



i 




O 

ERIC 




350 

JA84.R9 U8 1964 



Utechin, Sergei 

Russian Political Thought 

Sum= 409 / 29.54 = 14 

50 Economy 

I 45 Classes, social 

i 43 Law 

1 : 

i 39 Germany 

34 Individualism 
34 Monarchy 

33 Emigration 
; 33 France 

^ 33 Peasantry 

I 32 Intelligentsia 

31 Education 

- 30 Property 

I 30 Terror 

h 

29 Christianity 
I 29 Culture 

I 29 Equality 

29 Moscow 
29 Nationalism 
fr 29 Nobility 

f: 29 St. Petersburg 




Cooper, Lane 

Two Views of Education 



375 

LB875.C7 1922 



Sum= 775 / 29.54 = 26 
43 America 
42 Milton 
39 Plato 
34 Shakespeare 
27 Aristotle 
26 Homer 

26 Teacher (of English, etc.) 

25 Greek, Study of 

24 Middle Ages 

22 Horace 
22 Wordsworth 

21 Bible 

20 Latin, Study of 
18 Cicero 

17 Rewards of the Teacher 

16 Odyssey 
16 Rome 
16 Virgil 

15 Chaucer 
15 Discipline 
15 England 
15 Socrates 

14 Dante 

13 Democracy 
13 Greece 
13 Rousseau 



ERIC 







; 400 

; Morgan, Alexander LC191.M6 1916 

f Education and Social progress 

f Sum= 254/29.54=9 

8 Children, disease-s of 
i 8 Education, practical 

8 Inter-Dcpartmcntal committee 

( 

7 Kindergartens 
7 Practical education 
7 Vocational education 

• 

^ 6 Conm\ission, Royal, on Poor Laws 

i' 6 Continuation education 

6 Edinburgh , continuation schools 
j 6 Education, continuation 

6 Education, vocational 
6 Education and health 
[ 6 Plato 

6 Scotch Education Deptartment 
6 Slums, children in 



1 ’ 

1 














Tomkins , Calvin 

The Bride: and the Bachelors 



425 

ND553.D774 T6 1965 



Sum= 414 / 29.54 = 14 

22 "Bride Stripped Bare By Her Bachelors, Even, The" 
(Duchamp) 

15 Tudor, David 

14 Cunningham, Merce 

13 Rauschenberg, Robert 

12 KKiver, Billy 

11 Johns, Jasper 

10 Duchamp, Marcel 

8 Feldman , Morton 
8 Thomson, Virgil 
8 Tinguely, Jean 

7 Arensberg , Walter C. 

7 Breton, Andre 
7 Cage, John 

7 Cowell, Henry 
7 Dreier, Katherine 
7 Kashevaroff, Xenia Andreevna 
7 "Nu Descendant un Escalier" (Duchamp) 

7 "Nude Descinding a Staircase" (Duchamp) 

7 Schdnberg , Arnold 





186 



; > : 



450 

Bobbc, Dorothic (De Boar) PN 2598. k4 b6 1931 

Panny Kemble 

Sum- 384 / 29.54 = 13 

42 Butler, Pierce 

40 St. Leger, Harriet 

36 Kemble, Charles 

22 Butler, Sarah 

19 Butler, Fanny 
19 Siddons, Sarah 

18 Covent Gargen Theatre 
18 Lenox, Mass. 

17 Kemble , Adelaide . 

17 Kemble, Mrs Charles 
17 Sartoris , Mrs Edward 
17 Slavery 

15 Kemble, John Mitchell 




187 




475 

IIoppo, Harry Reno PR2831.II6 1948 

The Had Quarto of Romeo and Juliet 

Suin= 529 / 29.54 = 18 

38 Greg, Walter W. 

22 Chambers, (Sir) E.K. 

18 Hart, Alfred 
18 Me Kerrow, R.B. 

14 Burby, Cuthbert 

13 Boswell, Eleanor 

13 Greene, Robert "Orlando Furioso" 

12 Arber, Edward 
12 Recollections 

12 Shakespeare, William, "The Merry Wives of Windsor" 

11 "Orlando Furioso" 

10 Anticipations 
10 Chamberlain "s Company 
10 Dan ter, John 

10 Shakespeare, William "3 Henry VI" 

9 "3 Henry VI" 

9 "Merry Wives of Windsor, The " 

9 Peele, George "The Old Wives" Tale" 

9 Peele, George "Edward I" 

9 Repetitions 




188 , 



500 

Ryal.s, Clyde;, do L PR5508.R9 1964 

Thomo and Symbol in Tennyson's Poems to 1850 



Sum= 322 / 29.54 = 11 

37 Keats, John 

23 "in Memoriam" 

22 William Wordsworth 

20 "Two Voices, The" 

17 "Palace of Art, The" 

15 "Lotus Eaters, The" ■ 

15 "Ulyssus" 

13 "Mariana" 

13 "Recollections of the Arabian Nights" 
12 Hallam, Arthur Henry 




189 



r 



Kock, i'li.-nst. A].b.i.n, ed. 

Den No7:sk-Islandska Skaldcdikkningen 

Sum= 626 / 29.54 - 21 

2 13 j ark: BjarkamAl, anon. 

2 Danir. Danir, anon. 

2 Filing: FinngAlkn, anon. 

2 J6m.svikingar anon. 

2 Karlcvi: Karlovis tenens drottkvildade vers, anon. 
2 Oddm. : Oddmjor anon. 

2 RauSsk.’: RnuSskeggr. anon. 

2 Sveinn cjuguskogg, anon. 

2 Svtjilig: Sveinn tjuguskegg, anon. 

2 T&ngbrand och Gudlev, diki om, anon. 

2 Vagn; Vagn Akason anon. 

2 AEvidripa (orvar-odds) ; ur Qrvar-odds saga 



525 

PT7244.K57 1946-49 





\ 



o 

ERIC 



550 

Osl.rowr.ki , Alexander QA303.008 v3 1945-54 

Vorlesunyon Uber Differential und Integralrechnung 

Sunv^ 868 / 29.54 = 29 

14 Cauchy, A. 

9 ]^uler 

6 Cantor, G. 

6 Dirichlet 
6 Gauss 
6 Hardy, G.H. 

6 Weierstrass 

5 Abel 
5 Ellipse 

5 IlerPiite , , . . ^ . i 

5 Konvergenzkr i ter ien ftir uneigentliche Integraie 
5 Schwarz, H.A. 

4 Bertand, J. 

4 Bolzano 
4 CesAro 
4 Hausdorff 
4 Jensen, J.L.W.V. 

4 Newton 
4 Pringsheim, A. 

4 Kiemann, B. 

3 Carathe6dory . • 1 

3 Cauchy-Bolzanosches Konvergenzkriter lum | 

3 Chaundy 
3 Enveloppe 

3 Fresnelsche Integraie 
3 Iladamard . 

3 Inhalt ) . „ ^ 

3 Konvergenzkriterium ftir unendliche Cauchy -Bolzanoshes 

3 Poisson 
3 StidtjGs 

3 Vergluchskriteriura fUr unendliche -uneigentliche Integraie 
3 Zusammenheing^ind 

i ' 

\ \ . 




V 



Soulo, Byron Avery 

Library Guide for the Chemist 

Sum= .1833 / 29.54 = 28 



5 Gregory 

4 BOttger, Wm. 

4 Furman, N.H. 

4 Water, analysis of 



>. 

f 

I 

f 

K. 

»'• 

f- 

i' 




.3 Biography , German 
3 Browne, C.A. 

3 Classen 
3 Daniels, F. 

3 Dyes, patents on 
3 Ferro-alloys, analysis 
3 Findlay, A. 

3 Glasstone, S. 

3 Hahn, D. 

3 Hall, W.T. 

3 Houben, J. 

3 Indexes, patent 
3 Koltoff, I.M. 

3 Martin, G. 

3 Meyer, R.J. 

3 Mnemonics 

3 Nomenclature, organic 
3 Organo-metallic Compounds 
3 Ostwald, Wm. 

3 Patents, dye 
3 Rossman, J. 

3 Steel, analysis of 
3 Sugar, analysis 
3 Thorpe, Edw. 

3 Weiser, H.B. 

3 Worden, E.C. 




192 




575 

QD9.S71 Ref. 



/ 



1938 






Vcirley, Ernest Reginald 
Sillimanite J 

Suin= 729 / 29.54 = 25 



600 

QE391.S5 V3 1965 



11 Reserves r India 
10 Andalusite: U.S.A. 





8 Kenya 

8 Reserves, U.S.A. 

7 Assam, India 
7 United States 

6 Florida, U.S.A. 

6 Georgia, U.S.A. 

5 Benef iciation , U.S.A. 

5 Bihar, India 
5 Brazil 

5 California, U.S.A. 

5 Dumortierite , U.S.A. 

5 Mysore, India 
5 Nyasaland' 

5 South Africa, Republic of 
5 Topaz : U.S.A. 

^ Alurrinium industry 
4 Andalusite: U.S.S.R. 

4 Andhra Pradesh, India 
4 Baker Mountain, Virginia 
4 Density 

4 Graves Mountain, Georgia 
4 Henry Knob, S. Carolina 
4 India 

4 Kerala, India 
4 Kyanite density 
4 Lapsa Buru, Bihar 
4 Madhya Pradesh, India 
4 Maharashtra, India ^ 

4 Nevada, U.S.A. 

4 New South Wales, Australia 
4 Orissa, India 
4 Reserves, U.S.S.R. 

4 Sillimanite minerals : density 
4 South Carolina U.S.A. 

4 Transvaal, South Africa 

4 United States, National Stockpile Purchase Specification 



193 



625 

Davitt, DaVid Edward QL703.D3 1963 

Principles in Mammalogy 

Sum= 836 / 29.54 = 28 

11 Carnivores 

10 Bat(s) 

10 Insectivores 
10 Woodchucks 

9 Marsupicils 
9 Mutation 
9 Teeth 

8 Monotremea 
8 Whale (s) 

7 Ilerbivoros 
7 Oposmims 

7 ]‘Uu-op(NUi 3'abbit (s) 

7 Raccoons 

6 Dispersal 
6 Maintenance 
6 Predators 
6 Primates 
6 Shrew (s) 

6 Vole f. meadow 

5 Body size , temperature 
5 Camels 
5 Competition 
5 Corpora lute a 
5 Elephants 
5 Feedback 
5 Food 
5 Fossils 
5 Migration 
5 Moie 
5 Muskrats 
5 No arctic region 
5 Omnivores 
5 Oriental region 
5 Sex ratio 
5 Squirrel(s), ground 
5 Temperature 





194i; 




Ewerhardt, Frank Henry 
Therapeutic Exercise 

Sum= 338 / 29.54 = 11 

8 Muscle contraction 
8 Paralysis 
8 Posture 

7 Spastic paralysis, exercise in 

6 Flat foot 
6 Muscle function volitional tests 
6 Poliomyelitis treatment 
6 Re-education 

5 Hemiplegia 
5 Lordosis 

5 Poliomyelitis testing by topographical observations 
5 Poliomyelitis treatment during acute stage 
5 Re-education of upper extremity 
5 Scoliosis 

5 Upper extremity re-education 
5 Zero position 



650 

RM721.E8 1947 




195*'5i 



675 

TK153.U5 1933 



Underhill, Charles Reginald 
Electrons at Work 

Sum= 3827 / 29.54^ = 4 

9 Tube, gaseous-discharge 

7 Hertz 

7 Light, ultra-violet 
7 Maxwell 

7 Reaction, Reactors 
7 Valence’ electrons 






700 

Bibliography of Medieval Drama Z5782.A258 Ref. 1954 

Stratman, Carl Joseph 

Sum= 2797 / 29.54^ = 3.2 

44 Passion 

37 Comedy 

32 Latin 

31 Staging 



I 



I 

f 

I 

i 

I 

t 

i 

I 

> 



I ' 



I 



I 

I • 



r 

I 



o 



ERIC 

MMifflffTiTLilJ 




APPENDIX II 



PAGE REFERENCE DISTRIBUTIONS 
FROM 

THE FONDREN INDEX SAMPLE 






NUMBER OF INDEX ENTRIES 



10 



4 



10 



3 



10 



2 



10 




10 ' 



10 



2 




NUMBER OF PAGE REFERENCES 






1 



NUMBER OF INDEX ENTRIES 




2oas 




ERIC 

<iU« — 



o 

ERIC 




NUMBER OF PAGE REFERENCES 

206 ^^-^ 



NUMBER OF INDEX ENTRIES 




NUMBER OF INDEX ENTRIES 



saiaxNa xaaNi ao aaawnw 





NUMBER OF P^GE REFERENCES NUMBER OF P^GE REFERENCES 



NUMER OF PAGE REFERENCES NUMBER OF PAGE REFERENCES 



'■ \ ; 



MOO 
MO to U) 







J 

i 



a 




M O 

o to 




BV2532.M3 1921 D443.C6 1927 



NUfffiER OF PAGE REFERENCES NUMBER OF PAGE REFERENCES 



O.* 



MO O 

MO ro u> 





NUMBER OF Pi^.GE REPERENCES NUMBER OF PAGE REFERENCES 




NUMBER OP INDEX ENTRIES 




DA690.K7 1922 i\ EB199 .W595 1960 



yJUyiBER OF P^iGE REFERENCES NUMBER OF PAGE REFERENCES 



V 

! \ I-* I-* 

MOO 





\.1 



o 

to 



o 

U) 





PT7244.K57 1946-49 \ QA303 088 V3 1945-54 





saiHiiNa xaaNi ao aaawfiN 




ERLC 



o 



NUMBER OF PAGE REFERENCES NUMBER OF PAGE REFERENCES 



saiHANa xaaNi naawnN 




UMBER OF PJ^GE REFERENCES 



APPENDIX III 



AMALGAM’*‘TED ALGORITHMIC INDEX TO 
ABSTRACTS IN STATISTICS 






INDEX TO ABSTRACTS 




ABSiJL'JIH CLN1t4AL HGiCNFf 40- 
0721 

AOSm’.JTK jlntaal moment of 
imnc-rt, 40-0 721 
ABSJLDTfc CONTINUOUS 
oI ST.< iBUTIUNt 4C-07ifT 
AOSiTLUTk t)IFFFr<E^'.Ef 40-0724 
AttSiJLUTE FKLOUtNClieS UF 
ATTXIUUTtS, 33-1480 
AUUiriVITV OF EFFECTS, 34- 
U3!>7 

AOUIIIVITY OF MAIN EFFECTS, 
JA-OaST 
AOMISSIBILITV 

PHOoF of, 40-1859 
ALMOST invariant COVFIOENCE 
*»ROCfcOURES, 33-1480 
ALMOST NEGLIGIDLF, 54-0355 
MTcRNATIVe UbFINITIJN, 41- 
U329 

ALTERNATIVE PkO?F, 40-2219 
ALTi_.UMTI VES 

PARAMETRIC CLASSES OF, 41- 
0329 

ScOUENCE OF, 40-0722 
ANALOGOUS DISTANCE, 40-2219 

analysis 

VALIDITY i;F, 34-0557 
ANALYSIS OF FATIGiF LIFE, 

35- ISO/ 

AkA»vi>is it inf J5 NM ION, 40- 
0724 

analysis of variance 

IfilNTIlY, i5-I4«0 
A>IALYSIS OF variance TESTS, 
»4-0 5S 7 

ANjLOMOF.'j (Hff)RTH 

• XI. NSIDN l?l , 40-2217 
APPi IL A I I.iNS ii( w, aK 
. ONVf. KOI NLT , 4J-22I7 
\>'»^R'IPRI A II UFURMS OF 
• RK'iKjM, 40-0 721 

I NATI iNTtRVALS, 40- 
J 

i.'imnjxIMATI PnwtKS, 40-0722 
A 1 VAL liATLM, 41 -0176 
•* V SLALI , 54-OiSS 
A »M}L lAT lON 

Ml A SURi S l U , 35-1 4 ‘*0 
A;>.0MPTMN dF . OiJAL, 40-1857 
ASYMPTUTIC OlHAVlUR, 33- 
1 v*12 , 4f>— 18S6 

asymptotic comparisons, 40- 

ASYMPTOTIC L.‘VARIaNCE 
HATHIXy 40-1 BSB 
asymptotic OISTRinUTlON, 40- 
d 725, 40-185B 
ASYMPTOTIC OISTRiliUTION 
theory, 40-0721 
ASYMPTOTIC EXPANSIONS, 40- 
0722 

asymptotic normality, 41- 
0329 

ASYMPTOTIC NORMALITY UF 

maximum hkelihooo 

o 




FPn>FH VeiTiVnna ep tITBI ABmAl>9 WA'PWfliMATTCAfc STATfSTfC# 



.ASYNPTUTH: riPTlMAUTY, 40- 
1 US 7 

ASYMPT.M 1C MPTI nUM 
PKiipi RT I { S • 40-0722 
ASVMTMOTIw PROPERTIES, 34- 
035S, <,o-iHsr 
ASYMPTOTIC lESTS, 40-0720 
ASYMPTOTIC THfORF.HS 

OFRIVftTinN JF, 40-2217 
ASYMPTOTIC THEOREMS IN 
statistics, 40-2217 
ASYMPTOTIC THcORY DF LINEAR 
COMBINATIONS, 40-2217 

attribute sequences, 33-14H0 
attributes 

absolute fREOUFNCIES OF, 
33-14,80 

AUTHOR USES, 34-0358 
AUXILIARY VARIATE, 40-2219 
AVERAGE OF UISTlNCT UNITS, 
34-0357 

AVERAGE «ISK C<lTEAIfJN, 40- 
2219 

average sample sue, 40-2210 

B 

BAHADim EFTMCIENCY, 40-IB58 
BARTLETT'S TEST, 40-0722 
basis '’T samples, 40-2216 
RAICH ARRIVAL i) I S TR I BL>T lllN^ 
4I-012B 

BATCH St RVICr , 41-052B 
BAYl S RUl \ , 40-07P3 

bfssfl functions uf matrix 

ARGUMENT, 34-0358 
rit'SI INVARIANT, 40-1H56 
IJlVrtPlATf CAS'-, 4,0-0724 
BIVARIATI GAUSStAN OENSITY 
FUNCIION, 40-0 724, 
lU Y TH 

MLTHHIJS i)F, 40-1859 
HDTIY iJF NORMAL- THEORY 
TLCHNlUJiS, 40-1858 
liOKl L ^-ALGFBkA, 41-0330 
BtJUNUS 

TXISTLNCF OF, 41-0329 
BOX'S I OLA, 4t)-lB4B 

C 

CANONICAL CORRELATIONS, 40- 
0724 

CASE fF PHASE SERVICE TlME 
OISTKIBUTION, 41-0328 
CAUCHY OISTRIUUTIUN, 40-0723 
CELL FNTRIESf 40-0724 
CLLL means, ‘34-0357 
CHARACTERISTIC FEATURE, 33- 
1480 

CHARACTERISTIC FUNCTION, 40- 
IduO 

characteristic life, 33-1532 
characteruations of 

SYMMETRIC STABLE PRQCESSESt 
40-1859 

characterize OEPARtUAbS, 34- 
0357 



CHERNOFF-Sft VAbb -S T A 1 1 ST1C5, 
43-721 7 

CHI-SOUARE OISTRIBUTI.JN, 40- 
0721 

CHI-I'OUARF variate, 40-1860 
CLASS OF conjugate PRIOR 
niSTRIdJTI ONS, 41-0328 
CLASS OF. CONJUGATE PRIORS, 
4I-032B 

CLASS OF NON-PARAMETRIC 
ALTl <^NATI VES, 41-0329 
CLASS OF N3R1AL 
01 STRIBJT in.MS, 33-1506 
CLASS Of PRUCrSSES, 40-1856 
CLASS OF sequential 
PROCEOURES, 40-2216 
CLASS OF STn:HASTIC 
PROCESSES, 40-1856 
CLASSES OF 0ISTRI3UTI0NS, 

33- 1506 

CLASSES OF ESTIMATORS, 41- 
0329 

CLASSIFICATION PROOEOURES, 

34- 0358 

COEFFICIENTS OF VARIATION, 
40-0723 

COLUMN VECTORS, 34-0356 
CUM8INA T-3U AL MtTHOUS, 40- 
1856 

COMPACT SITS, 41-0350 
LOMPARATI VE STUDY, 54-0355 
COMPARE BARTLETT'S TEST, 40- 
0722 

C.OM'»ltT? INVARIANT, 40-2219 
CMMPlfTT SAHPLFS, 40-0723 
CIJMPILX CASE, 40-0721 
COMPIJNFNTS IN MimFL 11 
ANOVA, 40-0720 
CHHPOUND PMlSSOf'T PROCESS, 
40-1856 

COHPUIATI 'INAL EXPENOI TURFSv 
40-^2216 

CONCEPTS OF Pi! TMAN 
efficiency, 50-115B 
CUN0ITI3NAL D I S TR I 80 r I ON, 
40-2217 

CONOITIINS OF WALO, 43-1857 
CONIIOENCE INTERVAL, 0- 
0720, 40-1960 
CUNT lot NC: PROCEOURIS 
MEASURABLTi, 33-1480 
ClINt IOENCl region, 33-1490, 
40-2216, 40-2217 

confiuence: sets, 40-1060 

CONJUGATE PRIOR 
OISTKIHJTIGNS 
CLASS OF, 41-0328 
CONJUGATE PRIORS 
CLASS JF, 41-0328 
CONSERVATIVE NONPARAMtTRlC 
TEST, 40-1857 

CUNSISTcNCY PROPERTIES, 40- 
1856 

tONSTANl MULTIPLE, 40-1859 
CONSTANT SCATTER, 40-1859 
CONSTANT TIMESi 40-1360 
CONTINGE.NCY TABLEt 33-1480, 
40-0724 

CONTINQUS MEASURABLE 
FUNCTIOMt 41-0329 



CONTINUOUS CDF'S, .54-0157 
CUNTlNUtlUS FUNCTION, ^.'0- 
0722, 40-1859 

CONTINUOUS PUPULATIONS, 40- 
22 16 

CONTINUOUS PROBABILITY 
OENSITY FUNCTION, 34-0355 
CUNTINUdUS TIME VERSION OF 
FELLFR'S COMBINATORIAL 
LEMMA, 40-1856 
CONVENTIONAL HISTOGRAM, 40- 
1856 

CONVERGENCE 

RAPIUITY OF, 40-0723 
RATE OF, 40-0722 
CONVFRGkNCf. IN 0 1 S TR I BUT I ON, 
41-0330 

CONVERGENCE IN PROBABILITY, 
40-1359 

CONVEX LOSS FUNCTION, 40- 
1659 

CONVOLUTION TfCHNIQUES, 40- 
0721 

CORRELATION COEFFICIENT, 40- 
0720, 40-0724, 40-1859 
CORRELATION PROBLEM, 41-0329 
COVARIANCE 

UNIVARIATE ANALYSES OF, 
40-D721 

CUVARIANCb* MATRICES, 40-1059 
EQUALITY OF, 40-1050 
STRUCTURE OF, 40-1H50 
COVARIANCE matrix, 34-935B, 
40-0722, 40-in56, 40-1057, 
40-7216 

CUVFRAGI PRiJhAHIL I TV, 40- 
18 5 7 

D 

OAM MlltlEi , 40-1856 
DECIMAL all S, 40-0723 
DLCISIUN I UNCTIONS, 40-0722 
UECISIun RULf S, 34-0J55 
DEI 1C lENC (ES H.R.T, 40-2219 
OLF INI T 1 UN UF L INEAR 
SUFI 1C IFNCV, 41-0329 
ObGRFlS nr FRFEIMiM, 40-0720, 
40-0 72 2 

DEPENUbNI GAUSSIAN Vl-CTORS, 
40-0724 

OEPl.NnfNT VARIABLE 
REPRESENTS UEVIATIONS, 14- 
0357 

DERIVATION UF ASYMPTOTIC 
theorems, 40-2217 
DbS RAJ, 34-0357 
DESIGN MATRIX, 40-0722 
DIAGRAM OF SERIAL 
ASSOCIATION, 33-1480 
DIFFERENCE ESTIMATOR, 40- 
2219 

OIMENSIONAL 0 IS TR I BUT I ON, 
34-0355 

dimensional VECTOR, 34-0358 
DlMENSl ONS 

FLAT SPACE OF, 40-0720 
discrete random VARIABLES- 
SET OF# 40-2217 



1 



.)1S»»0INT INftKVAlSt A0-??1V 
I)4!^IANCI S 

fbTlHAtnUb Uf, AO-OTaZ 
UNllS 

AVMIA(;i ill» J4-0)*>7 
(IU1 1 iiN 

<.UMVbRC»)*NCl INt ♦ 1-0330 

rtANUOM VAUIAHU- ^VARIANT 
iNi 40-2ZUi 

;)l b fR fHUr IllN (^UMCriJN* AO- 
HZ.Mt AO-UIS/, «*0-?ZlA, 40- 
4l-03<’‘i 

(llbTH IROTinN-l KU‘ 

INHMhNCtbo 34-0 30 5 
.HSTf;lrt*JlH)N-^RU; TJLtRANCE- 
LIHIT lABLES 

rXTbNSIVI; SIT OH, 34-0355 
OIMR lOUTlONb 

CLASSKS OF, 33-1506 
i)IS r.3 lOUTIONS OF POiNTSt 33- 
1 506 

01. UHL F EXPnNLNKiiL 
•H ^»7HIHUTlUN, 40-07Z3 
0*AUb 

XUHHFR OF, 

E 

fdSrPHN CliNNFCTICUT STAfC 
u ;LLI:GC:, 40-2Z I 7 
FFr t C I S 

AUDIT IV IlY OF* 34-0357 
t 'PMICAL PRUCfcSS, 40-2217 
FKPIRICAL PROCb'SSFS* 41-0330 
.(.DAL 

AbiUHPTION OF, 40-1B57 
fcOUAL Dir.TRIBUl IONS 

HYPOTHtSIS OF, 34-0355 
eODAL SIZE 

SUCCESSIVE CLUSTE.^S OF, 
33-1480 
LvJaLITY 

HYVIITHESIS OF, 40-1858 
hOUALITV OF 0:VARIANtfc 
.NATIUCES, 40-1853 
bOUALITY OF VARIANCES, 43- 
1853 

L'RKUR LOSS FUNCTION, 40-1854 
MUOEL, 40-2216 

t'RRjaS OF HISCLASSIFfCATION, 
*fO-22l6 

IrtROHS OF PREDICT IONS, 34- 
OiSO 

cSTIHAIILt FUNCTION, 40-0721 
itSTIMAIORS 

CLASSFS OF, 41-0329 
*.S1IMAP»RS OF DISTANCES* 40- 
1172*' 

LliCLIOFAN N-SPACl, 40-2210, 
40-221 V 

EXACT CnNFlOLNCE INTERVALS* 
>0-l»u0 

.XiSIIMU. ^’F 80UN JS, 41-0329 
IXPI-UAllON IN IH'.iMEM, 41- 
.J3 2« 

xmnfniiai distribution* 
^♦0-2 219 

• XP'NkNTIAL FAhILIIS* 40- 
( «460 

IXPONFNTIAl RANODN VARIABLES 

KANiJliN SUM Of, 41-032B 
FXUNSliJN OF ANSCOMBF. 

(MIUHfM, 40-22 IT 
r XI t NS IV! SET n 
DlSTKlrtUT lON-FREl 
riiUKANC*^-LlM|| TABLcS* 34- 
0J55 

I xUlfcME CASE, 43-1480 

F 

FALfORS 

MAXIMUM number UF, 40- 
0720, 40-0723* 40*2217, 
40-2219 

FAILURE TIMES, 33-1502 
FAMILIES .IF FINITE MEASURES* 
40-2219 

FAMILY OF PROeABlLlTY 
DENSITlESf 40-0722 
FAMILY OF PROBABILITY 
40-0722 




FATIGUE LIFE 

ANALYSIS m , 33-1502 
FtLLUl'S CUMIUNATUR lAL LLHMA 
CUNUNUIJUS UMF VERSION 
01, 40-1 BS6 

MNOiNGb nr HAJIK* 40-1B57 
riOllK Ml ASURI , 40-0722 
MNIK MLASUI'.t FIELD* 3 3- 
148(1 

F INI II: Mt.ASURi;^ 

FAHILIIS UF, 40-2219 
riNlTL MOMI.-NIS, 40-0721 
FINIII NUMBER, 40-2216* AO- 
221 7 

HINIU POPULATION, 40-2217* 
40-2218* 40-2219 
riMlli: PROJECTIVE GEOMETRY, 
40-0723 

POINTS IN, 40-0720 
FINITE PRUJfrCrIVe SPACE 
POINTS IN, 40-2217, 40- 
221H 

FINI If si OUENCES 

RANDOMNESS IN, 33-1480 
riSHER-YATfS Tl'ST, 41-0329 
FIXED-SAMPLE PROCEDJRl:, 40- 
221o 

FLAT SPACE OF OIMEHSIONS* 
40-0720 
FURM 

FUNCTIONS OF, 40-1860 
HVPUTHtSES OF* 40-1060 
FORM UF GUESS* 40-0722 
FORM representations* 40- 
0T2I 

FORMAL PROCESS* 40-ld60 
FORMAL SOLUTIONS* 34-0358 
FORMULAE OF INTRACLASS 
CORRELATION, 33-1480 
FREEDOM 

APPROPRIATE degrees OF, 
40-0 721 

DEGREES OF, 40-0720* 40- 
0 722 

HYPUTHESIS DbGRFES UF , 40- 
0721 

N-2 DEGrt’ccS IF, 40-2220 
FULL SET, 40-n5n 
FUNCTIONS OF rJRM, 40'1B60 
FUTURE DdSKRVATIUN, 34-035& 

G 

GALOIS FItLO, 40-0720, 4D- 
221 7, 40-2218 
GALOIS r-lE.LD GFCS, 40-0723 
GALTUN lt=ST, 40-2217 
r.AMMA TL.Ti 34-0355 
GAOSS-MrtlJXUV ESTIMATE, 40- 
1860 

GAULS-MARKOV THEUREM* 40- 
0 72 3 

GAUSSIAN PRUCi^SbS* 41-033J 
GENERALI TY 

LOSS UF, 4 1859 

GUDAHiU*S on iNinUN OF 
LINEAR SUmciENCY* 41-0329 
GRAND mean Pi.US, 34-0 35 T 
GUh S S 

FORM OF, 40-0722 



H 

HAGA TEST, 40-2216 
HAJEK 

FINDINGS OF, 40-1857 
HAND COMPUTATION, 40-1857 
HAROLO CRAMER VOLUME* 40- 
1H56 

HARIKH-LUM TECHNIQUES* 34- 
035 7 

HIGHEST ORDER INTERACTION, 
34-035 T 
HIST(3GRAMS 

TYPES ilF, 40-1856 
HOGBEN ET AL« 40-0720 
HOHUGENeITY 

TESTS OFt 40-0724 
HUHOGENEITY OF VARIANCES* 
40-0722 

HOMOGENEOUS STOCHASTIC 
PRflCESSt 40-1859 

2 



HURVI T2-THUMPS0N ESTIMA70R* 
40-2210* 41-0329 
HYPERGUOKITRIC DENSITY* 40- 
0T20 

HYPOTHESES 

rests DF, 34-033B 
HYPUIHESES UF FDRM, 40-1060 
HYPOTHESIS DEGRltS DF 
FREIDLIM, 4O-0T21 

hypothesis of equal 

niSTRlBUriUNS, 34-0355 
HYPUTHESIS or EQUALITY, 40- 
1858 

HYPOTHESIS 0- MARGINAL 
HUMllGlrNEI TY, 40-0724 



IDEA or KATTI, 40-072? 
IDENTICAL SYMMETRIC 
UCNSISIES 

TRANSLATION PARAMLTEHS OF* 
40-1859 
IDLE SERVERS 

.NUMBER JF, 41-0328 
INADMISSIBLE ESTIMATOR, 40- 
1350 



U4CLUSI0N PROBABILITIES, 40- 
2210 

SEQUENCE OF* 40-2216 
INCLUSION PROBABILITY, 40- 
2213 

INCREASE 

POINT OF, 34-0355 I 

iNDEPENIENCE DF SETS* 40- 
1858 j 

INDEPENDENT COLLECTION^ 34- ; 

□355 ; 

INOEPENDENT ESTIMATE* 40- 
1860 

INDEPENDENT 1 .9 PAIRS* 3V’ 

□ 355 



inoepen)ent increments, 40- 

1B50 

INOEPENOENT NORMAL RANDOM 
VARIA8L.:S, 40-1 059 
INDEPENDENT OBSERVATIONS 
St-QUCKCE UF, 40-2217 
IMitEPENDENT P-VARlATE NORMAL 
POPULATIONS* 4n-iB56 
iNDfcPtNOENT PlMSSUN 
PRUCfSSr.S, 33-1502 
INDEPENOENT PROCESSES, 40- 



18 56 



INDEPENDENT <(ANOU.M SAMPLES, 
‘♦0-2.’! 6 

|N.'iEPLN) = NT RANOn** 

VARl ABl.ES* 34-03S7, 40-1857 
SUMS UF, 41-0329 
iNUEPfMDENT SAMPLE, 34-0350 
INOEPFMDENT UNOBSERVABLE 
RANOiJH VARIABLES, 40-1 BS7 
INDLPLNDLNT VARIARLl, 34- 
C-^57, 40-0720 
1 NO ICES 

TRIPLET OF, ‘34-0355 
INDIVIDUAL CUST.JMER, 40-1B56 
INDIVIDUAL REGRESSION 
CUEmCIlNT, 40-ld5H 
INFINITE CASES, 40-0720 
INFUIHATI JN 

ANALYSIS DF, 40-0724 
McASURLS OF, 40-2219 
INFDRMATION- THLJMETIC 
INEQUALITY, 40-0724 
INITIAL QUEUES OF SIZTS, 40- 
IHbb 

INTEGERS 

PAIR DF, 34-0355 
INTEGRANDS SATISFY, 40-1059 
INTE’URRIVAL OlSTRlBUI ion, 
41-0328 

interarrival times, 41-032B 
INTEREST 

PRUPE^ITIES JF, 40-0722 
REGION OF, 40-1860 
INTRACLASS ASSOCIATION, 33- 
1480 

INTRACLASS CORRELATION 
FORMULAE OF, 33-1480 
INVARIANT MARKOV KERNELS* 
40-2219 

lOUA CITY I) 41-0328 




J 

JACKKNtX ESTIMA7E* 4( 
JACKKNII f PRflCFDURE, 4 
JACKKNIFE TESTS* 40-18 
JOINT DIStHlDUTION* 34 
40-2216 

JOINT STATISTICAL 
CnHF-rRrNCE, 4I-03?H 



K 

K-OECISION PROOLEMS, 4 
K-OIMENSIONAL UNIT-CUBI 
0330 

K-STATIsr ICS 

MULTI PLICATION OF, 4( 
KATTI 

IDEA OF* 40-0722 
KOLMOGOROV TEST STATIS1 
41-0330 

KOLMOGOROV-SMIRNOV HS1 
2216 

L 

LACK OF RANDOMNESS, 33- 
LADDER PROCESS, 40-1856 
LAOOL-R PROCESSES, 4‘^18 
LARGE NUMBERS 
LAW OF* 40-2216 
LARGER MEAN, 40-1659 
LATENT ROOTS, 34-0353 
LATIN SQUARE, 34-0355 
LAW OF LARGE NUMBERS, 4< 
2218 

LEAST SQUARES, 33-1502 
LEHMANN* S 

VERSION OF* 34-0357 
LEHMANN'S TEST, 40-0722 
NULL OlSTRTROriON UF, 
0722 

lexicographic order, 34- 
LIKELIHOOD FUNCflUN, 40- 
LIKDLIHUOO RATIO, 40-072 
LIKELIHOOD RATIO CRITERI 
40-0722 

LIKELIHOOD RATIO CRITERI 
40-0721 

LIKELIHOOD RATIO TEST, 4 

□ 722 

LIKELlHOGD-RAril) TEST, 4 
IB5U 

LIMII THFOREM, 40-1856 
LINEAR COMBINATIONS 

ASYMPTDIIC IHCURY JF , < 
2217 

LINFAR f.()MH I nations (It 01 
STAnSIICS, 40-2217 
liniaw estimators, 41-oj; 

LINEAR lORMS, 34-0 i55 
LINCAP FUNC TION* 40-1 B60 
LIN(-AR MVFiJlHF.SIS PRU»L''H 

40- 0721 

LlNt.AR MUIH L, 14-U35 7, 40 
0722, 40-1857 40-222U 

LINEAR HfGRESSiUN, 40-185 
LINEAR SUFElf.llNCY* 4 1-01 
mt INIIIUN UP, 4 1-0329 
GnOAMUE'S OLE INI I lUN OF 
41-0329 

linear suff icient ESTIMAT 

41- 0329 

LINEAUI1Y VERSUS CUNVEXI7 
RANK TEST UF, 40«'l85r 
LOGISTIC OISTRIBUTIUN, 40 

□ 723 

LOGNORMAL OlSTHiBUTION 
MEAN OF, 40-1860 
LOSS OF GENERALITY, 40-lB 
LOWER BOUND, 41-0329 
LOWER ORDER INTERACIlUN, • 
□720, 40-0723, 40-2217, A 
2218 




N-WAY HAHGINAI. HOMOGENEITY 
QUESTION 0F» 40-0Ta4 ' 



.MAIN 


:;F» LC 


T# 


.H 


-0 35 7, 4 0- 


2217 










MAIN 


tFFEC 


TS 






ADl] 


1 T IVI 


TY 


Uf 


, 34-0157 


MAIN 


PORTl 


ONi 


, 40-185 7 


MANN- 


IDIITN 


EY 


-W1 


LCOXON TESTS 



'.44-0 «*>S 

MkHUCA MHO DU Ml ^0-i)72l 
MaKI'iINAL OLNSiriKSf 'iO~U72A 
MAkUlNAt OlSIMIIIUriOM. 3A- 
OViSt 

MAKul'MI. HiJHuGf.NJ II Y 

HYI’ Jiun SIS Ml f <*ll“072^ 
MKimUMH UK, 
t : SI S Ml , •»0-07^<» 

'lAHKOV KLRNI:!. CKilK^IUN, AU- 
22 W 

MAI lit X ARTfuMbM 

nis^ei. luNciiuNS .if, 

OiSH 

MAXIMAL OlSIANCtf 40-22H 
MAXIMUM UlAMKUU. AO-2217 
MAXIMUM INFUFHAflriN 
cXI'i HI Mi.NT, AO-2219 
MAXIMUM lIKlLlHOaOi ^0-1836 
HI IHMO IIF, 33-150? 

MAXIMUM LlKliLIHJ'JO tSTlHATE, 
.'A-0 ISH 

MAXIMUM LCCtLlHOOO liSTIHATES 
ASY.MF10T1C NORMALITY UF , 
SU-2217 

MAXIMUM likllihooo 
FSTIMATCUN, 33-1502 
MAXIMUM LIKELIHOOO 
histograms, 40-IS5& 

MAXIMUM number of FACT'IRS, 
<.0-0720, 40-0723, 40-2217, 
40-221B 

MAXIMUM NUMBER OF POINTS, 
40-0720, AO-0723, 40-2217, 
40-2218 

MCGILL UNIVERSITY, 40-22l8r 
40-2220 

HtAN UF LOGNORMAL 
UISTKIBUT ION, 40-ltlAO 
H t A N V fc L T 7 H , 4^ 

HcASUKfcS 

SEQUENCE UFi 41-03 lO 

miasurls ui associat Kin, 33 - 

I'tMII 

Ht AMIRIS III :ni :iRMAT ItIN, 40* 

MtUlAN TfST, 34-0355, AO- 
2216 

M»:HLLH*S IDENTITY, «0-0724 
M(-L|IN TRANSHIKMS, 40-1D6D 
HFIHOU OF MAXIMUM 
LlKhLlMtlOO, 33-1 507 
MtTHUOS UF BLYTH, 40-1859 
MICHIGAN STATE UNIVERSITY, 
40-0723 

MINMAX CONFltlFNCE 
PMUCMIURLS, 33-1 4R0 
MINIMAX RISK CRlTEkfON, AO- 
2219 

MINIMUM 111 SCPIMINAT I IN 
INIimMAIlllN {.SIIMATIUN 
PRINCIPLE OF, 40-0724 
MINIMUM niSr.RiHlNAT INN 
INMIHMAIIUN statistic. 40- 
0/24 

Ml Ml SUM M, AN SOUARf 
fSIlMAKlH, 41-0329 
MINIMUM NUHliER, 40-221 7 
minimum VAkIA.MCE ESTINATOR, 
40-0721 

HlRRitk images, 34-0355 
Ml SCLASSl Fir.ATION 
F.RHGRS UF, 40-2215 
S.iUiiL II AMUVA 

CU;iMUNFNTS IN, 40-0720 
MONOTONlt FUNCTION, 34-0357 
M* NTE CARIU, 33-1502 
MuNTE CARLO POWER 
CUMPAvUSCiNS, 40-185 7 
MULTI-COMPONENT STRUCTURES 
RELIABILITY FUNCTIONS OF, 
33-14 80 

MULTI 01 HENSIONaL CONTINGENCY 
TABLE, 40-0724 

multiple regression, 34-0357 
MULTIPLE regression OF 
I 0 VITY, 34-0357 




HOLT 


IPI ICAT li 


MN 


ALGOR 1 


40- 


Ins7 








HUL1 


IPLI 


CA Ttl 


DN 


UF K- 


Sr A 


T 1ST 


ICS, 


40 


-115 7 


MULT 


I V AR I A r F. 


CASE) 40 


40- 


KIStI 








HULT 


IVAR 


I ATE 


FAMILY, 


MULT 


1 VAR 


1 ATE 


N1 


RNAL L 


HYP 


iiTor 


SIS, 


14 


-0558 


HUL I 


IVAR 


I ATE 


Rl 


GRr.SS I 


0 156 








MULT 


IVAR 


1 ATE 


ST 


ABU 



numerous EXAMPLES) 4U-1060 



QlSTRimiMtiNS. 4U-1860 
MULTIVAUIATL STATISTICAL 
ANALYS ISf 40-1 Hf.O 
MUL I IVARI ATE- ST Al 1ST ICS 
NiINCFNTKAL III S TR I BU TI ilN 
PKMrtl-KMS IN, 34-0358 



N 



.N-2 UEGrlLLS OF KREElKlM, 40- 
222 0 

Nl YMAN ALLNCATION FORMULA, 
41-032 8 

NUN KANOtlM, 33-1480 
NUN-CENTRAL BK.TA VARIABLE, 
40-0720 

NON-IDENUIY TRANSFORMATIONS 
NUMBEk UF, 40-2216 
NUN-NLGATiVt CONSTANT, 33- 
I4H0 

NUN-NKGATlve DEFINITE, 40- 
f H6U 

NON-PARAME TRIG ALTERNATIVES 
CLASS m-f 4l-'i029 
NUN-STEAOY STATV LAPLACE 
TRANSFORMS, 40-1856 
NUN-STlICi^ASTIC PREDICTORS, 
40- 185 K 
NONAMIIITlVITY 

MULTIPLE REGRESSIUN OF, 

34- 0 35 7 

NUNCLNTRAL OlSTRlBUTlllN 

Problems in kjlti variate 

'’'^STATISTICS, 34-0358 
N()NCEMTRAL DISTRIBUTIONS, 
34-03511 

nuniintral Hill t 1 VAR 1 a TE DETA 

O IS T l<( ilUI tilN, T4-UI5II 
NONNIII L 01 S !R ll\Ul KINS, 40- 
0/2 2 

NUNVARAHl TRIG ALTERNATIVE, 
34-11355, 34-015/ 

NUNPARAMI TH 1C TESTS, 40-2216 
NURMAI ALTERNATIVES, 34-0355 
NURHAl CASE, 40-185B 
NURMAI UlSTRlBUTlON, 40- 
0722, 40-0723, 40-1857, 40- 
1860 

NORMAL UlSTRlRUtriONS 
C! ASS nr, 33-1506 
NIIKHAL means, 40-1859 
NDRHAl P'lPULAMONS, 40-0722 
NORMAL RANIIUH VARIABLE, 40- 
18M) 

NfIkMAL IHEflRY LIKELIHOOD 
RATIO STATISTIC, 40-0721 
NORMAL THEORY LlKlLlHnUO 
IIArlU n ST STATISTIC, 40- 
0 72 1 

NUWH.M THIDRY T-TESt, 40- 
U15 7 

NURMAL-lHiDRY TtCHNIOUtS 
HUIIY OK, 40-1858 
ROBUSTNESS OF, 40-1858 
NULL distribution OF 
LFHHANN»S TEST, 40-0722 
NULL DISTRIBUTIONS OF WlLKSc 
40-0721 

NULL HYPOTHESIS, 40-0722 



NUMBER 


UF 


DRAWS 


NUMBER 


IF 


IDLE 


0326 






NUMBER 


OF 


NON-I 



TRANSFORMaTIUNS, 40-2216 
NUMBER CF SERVERS, 41'>U328 
NUMBER OF SUCCESSES, A0-0T20 
NUMBER OF VAKlATES, 40-0721 
NUMBER SUCCESSES, 40-0720 
NUMERICAL COMPARISONS, 34- 
035 7 

NUMERICAL EXAMPLES, 40-0722 



UMF-DlMeNSlONAL EMPlRlCAf 
PRIICI 5S CONVERGE, 41-0330 • 
ONTO ITSELF, 33-14H0 
OPERATlnNAL CHARACThP I STICS , 

40- 221 9 

OPTIMAL ALCOr.ArKlM, 41-037B 
OPTIMAL ALlnCATIUN PRUBLLMS, 

41- 0328 

rrilMAt DESIGN, 40-1858 
IPTIMaL histogram, 40-1H56 
UPTlMAt STRATIFIED SAMPLING, 
41-0328 

•JPTTMUM ALLUCATII'JN, 40-2219 
optimum best LINEAR, 40-0723 
OPTIMUM BLUE'S, 40-0723 
OPTIMUM NON-PARAMCTRIC 
STATISTICS, 40-0722 
HROLR 

ABSOLUTE CENTRAL MOMENT 
(IF, 40-0 721 

DHOEM ABSOLUTE CENTRAL 
MOMENT, 40-U721 
ORDER statistics, 40-0723 
LINEAR COHBINATIUNS OF, 
40-2217 

SET OF, 40-0723 
OVERALL AVERAGE OF SUBSAMPLE 
MEANS, 34-0357 



3 



P-OIMENSIUMAL SPACE, 40-221T 
P-OIHENSIONAL VARIATE, 41- 
0328 

P-VARIATE DISTRIBUTION, 40- 
1857 

P-VAKlATE NORMAL 
POPULATIONS, 40-22\6 
PAIR OF INTEGERS, 34-0355 
PAIRS 

INDEPENDENT IN, 34-0355 
PAPER GENERAlIZESt 34-0355 
PAPER TREATS, 40-2219 
PARAMETER SET, 40-2219 
PARANI TRIG ClASSKS OT- 
ALII RNAT IVIS, 4I-OJ29 
PAST IJItSERVATI JN5 AVAlLAflLEt 
34-0356 

PHASE- DISf R IHIJT KIN, 4 1-0328 
PHASt SERVICE TIME 
DISTRIBUTIJN 
CASE UF, 41-DJ28 
PITMAN EFFICT NCY, 40-IB57 
CONCEPTS r.r, 40-1858 
PITMAN ESTIMATOR, 40-1859 
POINT OF INCREASE, 34-0355 
PJINT Ul VIEW, 41-032.8 
POINTS 

DlSTRIrtUriDNS JF, 33-1506 
MAXIMUM NUHUlR OF, 40- 
0720# 40-0723# 40-2217, 
40-2210 

POINTS IN FINITE PROUFCTIVi. 

GinMFTRY, 4U-0720 
POINTS IN FINITE PRUJECfiVE 
SPACE, 40-2217, 40-221H 
POPULATlOM MEAN, 34-0357, 

40- 2218 

population size, 40-0720 
POPULATION TOTAL, 40-2219, 

41- 0329 

POPULATION VARIANCE, 34-0357 
POPULATION VECTOR, 40-2219 
POPULATIONS CORRESPUNOf 40- 
1859 

POSITION INTERMEDIATE, 34- 
03 55 

POSITIVE CONSTANT, 40-1859 
POSITIVE DEFINITE, 34-0358 
PliSITiVE DEFINITE MATRIX, 
40*1856 

POSmVE NUMBER, 40-2216 
POSITIVE SOLUTION, 40-1B59 
POSURIOR COVARIANCE# 41- 
032 B 

POWER FUNCTION, 34-0355# 34- 
0357 

POWER FUNCTIONS OF TWO- 
SAHPLE RANK TESTS, 34-0353 



PRE-EMPTIVE resume PRIORITY 
SERVICE QISCIPUNE, 33-1502 

PFEOici ion: 

ERRORS OF, 34-0358 
PRH,. f :<i|NARY RCPOkT, 40-2220 
IMINARY TEST, 40-2220 
" PRINCIPLE (»F MINIMUM 

orjiCiUMHiATIDN INFURMATUiN 
FSMMATinN# 40-0724 
PRIORI KNOHLEOGE, 40-0722 
PRIORHY L5VLL, 33-1502 
PR(tUAHiLlST 1C CDNVLRGENCF, 
40-2218 

PRUBAH1L 1ST 1C PS DO-MtTRiC 
SPACr, 40-0722 
PROHAOILl TY 

CONVERGENCE IN, 40-1859 
THEORY UF, 34-0355 
PKUUABILI TY DENSITIES 

family nr, 40-0722 

PROBABILITY DENSITY, 40-1869 
PROBABILITY FIELDS, 33-1480 
PROBABILITY MEASURES, 40- 
2219 

PROBABILITY CF RANK ORDfRS, 
34-0357 

PROOABIL 1 TY SPACI S 
FAMILY UF, 40-0 722 
PROBLEM OF MARGINAL 
HOMOGENEITY, 40-0724 
PRCiBLEH OF SYMMETRY, 41-0329 
PROCESSES 

CLASS GF, 40-IB56 
PRODUCT distribution, 34- 
0355 

PRODUCT PtASURE, 40-221B? 
40-2219 

PRODUCT PROBABILITY 
MEASURES, 41-0329 
PRODUCT SPACE, 33-14BO 
PROOF OF AONl SSieiLl TY, 40- 
1059 

PROPERTIES OF INTEREST, 40- 
0722 

PROPORTION riF SUCCElS^tS, 40- 
0720 

PSI TEST, 34-0355 
PTH ABSOLUTE CENTRAL MOMENT, 
40-0721 




QUADRATIC HFAN, 40-IU59 
QUANT ILF PROCESS, 40-2217 
QUANTITATIVE SITUATIONS, 33- 
1480 

QUESTION UF M-WAV MARGINAL 
HdMOGENf 17V, 40-0724 
QUEUE SIZES., 33-U502 

R 

RANDOM MAlRlCtS, 34-0358 
RANDOM oust RVATIDN, 40-2217 
RANDOM SAMPLE, 34-0355, 40- 
0/20 

RANDOM sample UF SIZL, 3i- 
1502, 40-0723 
KaNDDH sum (IF FXPUN::NTIAL 
RANDOM VARIABLES, 41-0328 
RANDOM. VARIABLE INVARIANT IN 
UISTRIDUIION, 40-2216 
RANDOM VARIABLES, 40-0720, 
40-0722, 40-1856, 40-1859 
RANDOM VECTOR, 40-0722 
RANDlDHNESS 

LACK OF, 33-1480 
RANDOMNESS IN FINITE 
SEQUENCES, 33-146G 
rank order, 34-0357 
RANK ORDER TESTS, 40-1857 
RANK ORDER TESTS STATISTICS, 
40-0721 

rank orders 

PROBABILITY OF, 34-0357 
RANK STATISTIC, 40-IB57# 41- 
0329 

RANK TEST, 34-0355# J4-&357# 
40-1657, 40-2216, 41-0329 
RANK TEST OF LINEARITY 
VERSUS CONVEXITY# 40-1057 
rank-order test# 41-0329 



’.APIUIfV OF CmVERGENCkt 
0723 

RAIL OF CONVERGENCE* 40-0722 
'<i-AL CASE* 40-0721 
*<EAL LEBhSGlJF A»Ef 41-0329 
H».AL LINE* 40-0720 
Ai-AL nUETI VARIATE 

cHARACf EK 1 SI 1C FUNCTION 
f.XP, 40-1060 

RlAl VARIATE VAIUC. 40-2218 
.«^aSONABEL CKIIERION* 40- 
01 22 

KiCVANGULAR 01 S IR 14UT 1 ON* 

40- ^f)721 

M’LlAvNGUlAK VAK8AUE1S* 34- 
01t'> 

H t-FLl-i PR ?NCf PL I * 40- 
14*»6 

(^i-utoN OF Interest* 40-I8&0 

>;i CMs.SSItiN CASE/ 40-22 1 7 
i CKI:SSl(iN COEFFICIENT* 40- 
22 20 

1* GK-SS1UN NODLL* 40-1HS7 
H.oH‘SSIQN CCEFFICIENT S* 34- 
(iJ !>6 

•itOULARlTV CUn01T1>^NS» 40" 
2219* 41-0326* 41-C329 
RELIAOILITY FUNCTIONS UF 
•«ULT1-C0HP0NLNT STRUCTURES* 
33-1480 

RcPRbSEOTS BATCH ARRIVALS* 

41- 0328 

HUttFmv COEFFICIENTS* 33- 
14 AO 

PROCtOtKFS* 40-ie‘Sd 
RUBijSt TLl fS» 40-13b8 
A ‘BUSINESS 

VhLL-RNUNN LACK OF* 40- 

lasi^ 

RrittUSTNESS HF NOKNAL-THEURV 
TECHNIQUES* 40-lBSa 
RiJU VECTOR* 40-2217 
4i;y*S TEST* 40-1858 
RTH ASSOLUTE CENTRAL NONENT* 
40-0721 

s 

SAMPLE CORRELATION 
L .(EFFICIENT* 40-0720 
sample covariance matrix* 

tO-lS56 

sample IRPlRlLAL PROCESS* 
40-2217 

SaMPL;- means* 34-0356 
■iAKPLL PROBLEMS* 40-2217 
SAMPLE SUL* 40-0772* 40- 
in-*7* 40-lH4*»* 40-2n*>v 4l- 
0 529 

SAMPLE SPACE* 33-1480 
SAMPLE IFST, 54-0356 
SAMPLES 

tSASIS OF. 40-2210 
SAMPLING OEMGN* 40-2218 
sampling UESIGNS 

SiQUEMLE UF* 40-2218 
SAMPLING plSTnlBUTlONS 

UNIVARIATE UVE-PARAMETER 
^AHILV OF* 41-0328 
SAHPL ING scheme* 40-»772 
SATiSTSCAL INFERENCE ASPECT* 
40-0 722 

s«:alar constants* 40-1356 

S>.ALc parameters* 40-0723 
j .LUNG-OROEH H;INE.NTS* 40- 
IrtSd 

S PAHAOLE complete METRIC 
>PACE« 41-0330 
SEQUENCE OF ALTERNATIVES* 
•*0-0 722 

SiJUENCF OF INCl.JSlUN 
PRilBABlLlTlhS* 40-2218 
Si UUI.NCE OF INUEPENOCNT 
tJU.ERVATlQNS* 40-2217 

scc‘Uf:nce of mfasjres* 4*4- 

03 30 

SEQUItNCC of sampling 
GS'.IGNS* 40-2219 
SliOUENTlAL PROCEDURE* 40- 
1357 

SCQUENTIAL PROCEDURES 
CLASS OFv 40-2216 
SI O > 40-2216 




SEQUENTIAL SAMPLE SI2E* 40< 
1857 

SERIAL ASSOCIATION 
U1Ai;hAM of* 33-1480 
SERIES FiXPANSUN* 40-0724 
SERVERAL TYPES OF SYMMETRIC 
FUFX:TinNS* 40-IH57 

servers 

NUMHLR nF* 41-032P 
SERVICE <3PERAnnNS* 40-1H56 
SERVICE STAieotlS* 40-1856 
SiRVSCc time DUSTRlBUTEiJN* 
41-0328 

SERVICE ICME niSlRlHUTlUN 
FUNLT ION* 33-1 502 
service IlMES* 40-1856 
SfT 'll •)ISCRE1E MANDIJM 
VARIABLIS* 40-2217 
SIT rv* ORDER STAIISMCS* 43- 
0723 

SET OF VALUES* 40-0720 
SETS 

iHDf P*iNi)tNl> UF* 40-1858 
SETS OF VAFIIATES* 40-1858 
SIMPLE CASE* 33>14BU 
SIMPLE RANDOM SAMPLING* 34- 
0357* 40-2219 

SIMPLEST APPJUIACH* 33-1502 
SIMULATION TECHNIQUES* 4U- 
0723 

single server* 33-150? 

SINGLE MISHAKT MATRIX* 34- 
U35B 
SUE 

RANOdM SAPiPLE liF* 33-1502* 
40-0723 
SUES 

INIIIAL QUEUES .OF* 40-1856 
small SAMPLE a^^TRlHUTlON* 
40-0723 / ' 

SMALL SAMPtE POMFR* 40-1857 
SMALL SAMPLES* 40-0723 
SMALLFST INTEGER* 40-2216 
SMALLEST VAHlAULE* 34-0357 
SPHERICAL CONFlOENCf REGION* 
40“li^57 
SQUARES 

SUM OF* 40-0720 
STABLE PRIICESSFS* 40-1859 
STANOARO ERRIJR* 40- 185b 
STATlilN JOIN* 40-1856 
STATIONARY INDEPENDENT 
INCRrMKNTS* 40-1H56 
STATIONS LEAVE* 4>^1856 
STAT 1ST ICS 

ASYMPTOTIC T8E.1REMS IN* 
40-221 7 

STEADY STATE* 41-0378 
STtlCMASTlL INTbGKAI * 40-1R59 
STllClIASrir. PRIDICIUKS* 40- 
1857 

STIICHASUC PRlKESSES* 4C- 
1859 

CLASS OF* 40-la56 
Ml AK CUNVEKGFNCF. UF* 40- 
2717 

STKAKiHT LINE* 33-1502 
STRAIGHT-LINI MODEL* 40-0720 
STRAIuHTFURWARO MANNER* 34- 
035 7 

STRUCTURE OF COVARIANCE 
MATRICES* 40-1BS8 
STUDENT'S I-STAIISTIC* 40- 
2220 

SUHSAMPLE MEANS 

IIVF.RALL AVERAGE UF* 34- 
0357 

SUBSAHPLE SIZES. 34-0357 
SUCCESSES 

NUMBER UF* 40-0720 
PkI'PCIKTIUN DF* 40-0720 
SUCCESSIVE clusters OF EQUAL 
^IZl* *13-1480 
SUCCuSSlVL VALUES* 33-1480 
SUFFICIENT CONDITIONS* 40- 
18.59 

SUITABLE CONOITIUMS* 40-0722 
SUITABLE GENCRALIZAriUNSn 
40-IB59 

SUITABLE TRANSQFiUUTlIM* 34- 
09S7 

SUN OP SQUARES* 40-0T20 
SUNS OF INOEPENOENT RANDON 
VARIABLES* 4I-032V 



SUPREMUM functional* 40-1956 
SYMMETRIC FUNCTIJNS* 40-1857 
SERVERAL types OFe 40-1857 
SYMMETRIC STABLE PROCESSES 
CHARACTERIZATIONS OF* 40- 
1859 

SYMMETRICAL FACTORIAL 
DESIGN* 40-0720* 40-0723* 
40-2217* 40-2210 

symmetry 

PRU8LEM OF* 41-0329 
SYSTEM IDLE TIM* 40-1856 
SYSTEMATIC SAMPLING* 34-0357 

T 

TCHEBYCHEFF polynomial r 40- 

1858 

TECHNICAL REPORT* 40-0723 
TLLESCOPE PRI?«CIPLE* 40-1856 
TERRY TEST* 34-0355 
TIST CRITCRIA* 40-0722 
TEST STATISTIC* 40-0 722* 40^^' 
0723* 40-1858 
TESTS UF HOHUGEMEIJY* 40- 
0724 

TESTS OF HYPOTHESES* 34-0358 
TESTS OF marginal 
HOHUGENE irY* 40^0724 
THE 7R F 

EXPECTATION IN* 41-0328 
.^THEOREM CONTINUES* 34-0357 
THEURY (IF PRC58A81LSTY* 34- 
0355 

time spent* 40-1856 
TntrKANCE REGlHNSt 34-0356* 
34-0358 

TOLEKANCE-LIMIT PROBLEM* 34- 
0355 

TOPOLOGICAL ASSUMPTIONS* 33- 
1430 
rURJNTO 

LNIVERSITY *)F* 40-2217 
TOTAL npERATIlN TIME* 40- 
1Q5o 

TRANSFDRMATIUN GROUP* 40- 
2216 

TRANSLATION PARAMETER* 40- 

1859 

TRANSLATION PARAMETERS UF 
IDENTICAL SYMMETRIC 
UENSISIES* 40-1359 

Triangle inequality* 40-0722 

TRIPLET UF INDICES? 34-0)55 
TRU‘: PARAMETER* 33-1480 
TRUF PAAAMkn R VALUr* 40- 
22 I 7 

ISCHI BY-SCHFFF'S INIOUAI.ITY* 
40-2213 

TUNNEY* 40-2220 
TWO-DIMENSIONAL RANUOH WALK 
REPRE SENTATI JN* 4J-1B56 
TWO-SAMPLE PR>J8L;iM* 41-0329 
TwU-SAMPLE RA.NX lESTS 

POWFR FUNCTIONS OF* 34- 
0355 

TWO-WAY contingency TABLE* 
40-0724 

TYPE II* 40-2219 

TYPES UF HISTOGRAMS* 40-1856 

U 

UMP invariant test* 40-0722 
UNCONOITIUNAL 0 IS TR I BUT I JN 
THECHY* 40-0721 
UNEQUAL SAMPLE SIZES* 34- 
0355 

UNIFORM PRIOR* 40-1859 
UNIFORM WEIGHTS.; 40-0720 
UNIVARIATE ANALYSIS OF 
COVARIANCE t 40-0721 
UNIVARIATE ONE-PARAMETER 
FAMILY OF sampling 
OISTRIBUTIOVS* 41-0328 
UNIVERSITY OF TORONTO* 40- 
2217 

UPPER aOUHO* 40-0723* 4CH 
2217 



221 



V 

VALIDITY OF ANALYSIS* 34- 
0357 
VALUES 

SET OF* 40-0720 
VAN OER WAERDEM-Xt 41-0329 
VARIANCE ESTIMATOR* 34-0357 
VARIANCE IDENTITY 

ANALYSIS OF* 33-1480 
VARIANCE TESTS 

ANALYSIS nr. 34-035? 
VARIANCES 

EQUALITY OF* 40-1858 
HOMOGENEITY OF* 40-Q722 
VARIATE VALUES* 40-2219 
VARlATFS 

NUMIJtR OF* 40-0721. 

SEIS UF* 40-1853 
VAR lATlON 

CPEFFlCIcNTS MF* 40-0723 
VtCTOB CASE* 40-0724 
VE^slON OF LEHMANN'S* 34- 
0357 

VIA PITMAN EFFICIENCY* 40- 
1657 
VIEW 

POINT OF* 41-0328 



WALD 

CCNOITIONS UF* 40-185? 
WEAK ASSUMPTIONS* 33-1480 
WEAK CONVERGENCE* 40-2217 

APPL ICATliJNS UF* 40-2217 
WEAK CONVERGENCE UF 
STOCHASTIC PROCESSES* 40- 
2217 

WEAK SEQUENTIAL COMPACTNESS* 
41-0330 

WEAKER CONDITION* 40-2217 
WFIGULL OISTRIDUTION* 33- 
1502 

WFiaULL PROBABILITY GRAPH 
PAPER* 33-1502 
WELL-KNOWN LACK OF 
RUOUSTNESSf 4G-1858 
WELL-KNOWN SPACE* 41-0330 
WIDTH CONFIDENCE INTERVAL* 
40-1059 

WIENER PROCESS* 40-1859 
WILCOXON TEST* 34-0355* 41- 
0329 

WILKS 

NULL DISTRIBUTIONS UF* 40- 
0721 



appendix IV 



AMALGAMATED ALGORITHMIC INDEX 
ABSTRACTS IN CAJ?CER RESEARCH 




222 






a 


OODV HORHCNbS, 65-1462 


COMPLETc REGRESSION, 65-1418 


■■ 




A 


BODY wE IGHT, 65-1433 


COMPLICATION UF HOKHONE 






A 


ttONt LES ICNS, 05-1475 


IHdALANCE, 65-1450 


p 




01 


aONt METASTASES, o5-1464 


CONCCHI TANT 






AHI AT [UN 


KLCALC 1 FlUATIUN uF , 65- 


TRANSOlAPHRAoMATIC ' 


EFFECT UF BILATERAL 




fliOM flft 


14 36 


unilateral AURENALLCTGMY. 


COFHURECTCMY. 65-1435 




Ar.CCSSinLL SUFT H S SUt 
i-FT, ICJNS» A 5-1 A 36 


uuNL MAIN 

RT L IFF IN, 6 5-1395 


65-1409 

CUNSlOEKAcJLh IMPROVEMENT . 


EFFECT uF HYPUPhYSECTOMY, 
65-1404 




AUCUHAIF HISTOLCCIC 
.1IAGNJS£S» 65-1^0*; 


DUNY HETASTAScS, 65-1445 
BRAIN METASTASESr o5-1459 


65-1418 

CONTKLL GROUP, 65-1452 


EFFECTIVE PALLIATIVE 
TREATMENT, 65-1392 




ACniSON'S OlStASfc SECCNDARY, 
6 5-U2A 


BREAST 

palliation of# 65-1424 


CONTROL OF OIABLTtS 
INSIPIDUS, 6 5-1 ‘.8 5 


EFFECTS CF ACKENALECTCHY, 
65-1404 




ADRENAL CORTEX, 65-IA62 


BREAST CANCER, 65-1403, 65- 


COURSE UF DISEASE, oS-1405 


EFFECTS OF BILATERAL 




ADRENAL I NSUf F I C I L'NCY * 65- 


1413# 65-1439. <,5-145U. 65- 


COURSE OF ESTROGENS PLUS 


_ ADRENAt Pr.TrPY . 6^,-1415 




1 AOA 

ADRENAL MAHRARY CANCER. 65- 


1460, 65-147J 

CASES OF, .65-1392. 65-1419 ■ 


RADICTHERAPY, 05-1445 
CUMULATIVE LCNGE^/HY. o5- 


EFFECTS CF HVPCPHYSECrCMY, 
65— l403» 66—1476 




ADOCNAl. SU'RdlOS. 6?-IA?A 


ChEHICAL HORMONAL CONTROL 
OF, 65-14H4 


1410 

IB 


EFFECTS CF SURGICAL, 65-1439 
FIGFTEFN PATIENTS. 64- J 410 




Anur-NALrCTO/lY 

t-rn.crs nr, (i5-w«c/. 


nUFASl CANCER AUKCN ALEC I UMY , 
(i5- 14 2D 


U 


ELUCrRULYlh hEHEUSlASIS, 05- 
14 77 




RLVUw OF, 65-1 j«>C 
USf OF# . 


UKLASl CANCLR MLIASIASES, 
65-1410 


_UAY ICTERUS, 05-1473 


ENDUCKINE ABLAIIONS, 65-1405 
CNUCCRlNr UCNTfUJL 




AO<- bNALtCTCHY ARREARS, 65- 
1 395 


BHI: AST CANC.f* NUIES, 65-1396 
BREAST CARCINiTKl. 65-1390, 


ULFINITU bcNtFlCIAL EFFECT, 
65-14 19 


HEIHGDS CF, 65-140/ 
ENDOCRINE OcPtNDENCfc OF 




AiJRC-NALECTCMY IN CANCERp 
AOR nNALCCTHMY HANIFESIS 


65-1461, 65-1464 


DEFJNIIE CLINICAL 
IMPROVEMENT, 65-1465 


lUMCKS, 65-146/ 
ENueCKiNE GIANOS 




ITSELF, 65-IA33 




GcoHbE OF PALLIATION, o5- 


KEMCVAL OF, 65-14q2 




a:;f 


1 


1404 


ENDlchim: r.t am:s he six 




YEAPS OF, 65-1395, 65-iAlO 
ALKAL INF -RHOSRHAT ASE- LFVhL . 


_ CANCck __ ^ 


CEoKct OF TFYRQIOAL 
OUPRCSSlOf.# 65-1485 


PATIENTS, 65-1424 
EMJrCkiNfc PAHEhN. 65-I4P4 




6 5-I<. TO 

ALltVlATlON OF RAIN. 65- 


ACRLNALLCTLMV IN, 
CANCERS 


DUGKlUS UF OBJECTIVE 
KEURESSIuN# 1.5-1423 


EAJCCRINE therapy, 65-1478 
ENDCCKINP THFRA'PV defers 




U99, 65-1^23 
A;*PHnPMIl. CELLS# 65-1 A2 a 


PtRCENT UF, 65-IH62 
CdRCrfJGMATOSI S 


JcMCNSIRAHLE lOMUR, o5-L41fl 
UbSIRABLt TREATMENT# 65-k479 


FCRM CF, 65-1499 
ESTROGEN 




ANTERIOR PITUITARY, 65-1A62 
ANTI-INFLAMMATORY ACTION, 
65-1A77 

APPARENT EFFECT, 65-LAC5 


CASES OF, 65-1465 
CAKCINDMATUUS PftUSTATE 

SLIGHT FEGKESSIUN OF, 65- 
1423 


OtSCXYCUKriCCSTilRUNt 
OUScS UF , 05-1418 
OESTRUlTION UF residual 
ANTERIUR LCEE TISSUE. 65- 


DROP IN, 65-1407 
ESTRCGEN ADM IN IS THAT ION , 65- 
14 1C 

ESTRCGEN DEPRIVATICN, 65- 




APPEKANCE CF PRIMARY, 65- 


CASA UI CUKA, 65-1392 


1463 


14 3o 




IA03 


CASE HISTORIES, 65-1476 


GcVCLOPHcNT OF MAMMARY 


ESTROGEN cXCRtTlUN, 65-1425^, 




ARREST 

EVIDENCE OF, 65-1A63 


CASE OF HVPERTHYRai DISM, o5- 
14 50 


CARCINOMA ONU, 65-1436 
UEVELOPMENT Ur MENINGITIS, 


65-1476 

ESTRCGEN EXCRET/GN FELL, 65- 




ARREST UF DISEASE, 65-IA83 


CASE CF PFOSTATIC CARCINOMA, 


65-1465 


14 35 




AVcP’DE CURAT ION, 65-1395 


65-1445 


OIAUU-TES INSIPIDUS, 65-l4b3# 


ESTROGEN LEVEL 




AVCF DURATION GF 

Rb'MI’.. . ON, 65-1463, 65-1 483 


■ CASUS CF EREAST CAiNCER? 65- 
1392, 65-1419 


o5-l 479 

CONTROL OF# 65-1485 


KUDUCIIUN IN, 65-1435 
ESTROGEN LEVELS, o5-l435 




AVERAGE LENGTH CF SLRVIVAL, 
65-1464 


CASES OF CARCINOMATOSIS, 65- 
1 465 


DIAGNOSTIC IHCRACUIUMY 
CCMDINATICN OF, 65-1409 


■ESIKOGEN SECRETION, o5-i435 
h.STHGGEN THERAPY# 65-1423, 




AVERAGE LUNGtVnV, tS-14lC 
average survival, 65-1478 


CASTRATION IN wOMbl'l, 65-1450 
CASTRATION THERAPY, 65-1485 


DIMUTIUN IN Size, 65-1390 
DISAPPEARANCE OF LOCAL, 65- 


65-1445 

ESTROGENIC SUBSTANCES 




AVERAGE SURVIVAL TIME, 65- 
1476 


CEREBRAL METASTASES, 65-1390 
CESSATIUN OF HEHUPIYSIS, 65- 


14 61 

m SAPPEARANCE OF PLEURAL 


IKPURJANCc OF, 05-1436 
estrcgens plus KAUICIHFRAPY 




n 


1390 

CESSATION OF PAJN, b5-l39D 


EFFUSIONS, 65-1390 
DISAPPEARANCE OF SOFT TISSUE 


CCtRSE or-tf 65-1445 
EriCLCGlCAL FACTOR, o5-l473 




D 


CHEMICAL FGRM(}NAL CONTROL OF 
SREAST CaNCER, 65-1484 


LESIONS, 65-1430 

discuntinuatiun of 


tVALUABLE INITIAL CBJECTIVE 
KEHISSIONS, 65-1484 




WEST RESPONCe, 65-1399 
BEST RESPONSES, 65-1459 


CHIEF SIDE effects, 65-1463 
CHIEF TOXIC EFFECT. 65-1479 


TREATMENT, 65-1445 
DISEASE 


EVALUATION OF PALLIATIVE 
IREATHEM, 65—1408 




BILATERAL AOREN ALEC TCKI ES, 
65-1423 


CKORQICAL METASTASIS, 65- 

1443 


ARREST tlF, 65-1483 
COURSE 0F« 05-1405 


evidence CF ARREST, 65-1463 
EViLENCe OF Re ACT I VAT ION, 




BILATERAL AORENALEC TCMY • 65- 
1403, 65-1408, 65-1419, 65- 


CLINICAL COURSE, o5-i435 
CLINICAL STATE, 65-1435, 05- 


DISTANT METASTASES, 65-I46L, 
65-1484 


05—1436 

EVIDENCE OF SPINAL 




14 33 ^ 

‘ EFFECTS OF, 65-1415 


1476 

CCKtUNATICM OF CIAGNOSTIC 


DOSES UF 

UESOXYCORT ICOSTERONL-# 65- - 


REGENEKAIiON, 65-1470 
EXCEPTION OF HVPUPhYScCTCMY . 




UILATtERAL OOPHOKECTCMY 
EFFECT UFi 65-1435 


TIIORACU rCMY, 65-1409 
COMPARATIVE STUDY, C5-1452 


1418 

DRAMATIC REGRESSIVE# o5-i423 


05-1399 

EXCRETE ESIROGEN# 65-1476 




BILATERAL ItiTAL 
ACKENALECTCHY, 65-1418 


COMPLETE ALLEVIATION OF 
PAIN, 65-1458 


DROP IN ESiJiOGcN, 65-1407 
DURATION OF REMISSIONS. o5- 






BLASUC RESPONSE IN SKELETAL 
LESIONS, 65-1485 


COMPLETE PAIN RELIEF, 65- 
1461 


1436 

OURATIQN OF SYHPTONS, 65- 







o 

:mc 



yiBB 



i“» o n 



F 



IN U»;inAKY t&LClL4, 6'j- 



y V uK i\7a L ’ Vi: s i‘UN s’t | <j'j - I 3 fc 
_i I M Y rji 5IJ NJ[ J 6 ^-J A(l1_ 

MVl ICNl S, 
y (if: .MU.rtTluNt 



5«AA0UYICN HYPCPhYSUCTUrtY, 
__6Ji.-l_fj.6Jl 



NOTtrtUKTHY I H>*KJVCHtNT , 
U6^ 



P^KlCOb OF KbHlSSICNf tb-- 



s/ 

rv 



NUCLEAR PYKNCSIbf 61 j-U2'i 
NUM t3, £R U K. K CM A S.S 
U19 



PEKIICNEAL f<ETAS1ASESt 65- 






PITUITAKY stalk, 

PITLITARY stalk ShCTlQN, 61>- 



I Kf' J'- l.NOiir.^lKL IrtKAPY 
(• f _i; E'.1 1,. J: yj7 1. j :/J}. 

r ■ * IS .IP therapy"*, ’r/y - i o’ ' 

nuin H P s l > PAT Ji^Z 



KAl<LLiK:-<A hospital IN 
,„.S T Li.CK.Mj: 'll _ to 1 392 



i^as 

PLEURA L EFFUSI ONS 



1 



0 JfcCTIVE U C NhFITS, 65-I<i75 



CaJhCUve EVlUENCb, tob-1390 
uojEcrivh eviucncl of 



OISAPPEARANCE CF, to5-I39U 
PLEURAL HdTAST ASES, 65~14toA 



POST CPtKATIYt 
CQPPLICATIUNS. 65>1^Q^ 



IP 

va3 



PATH.MS, t; I- I A 35, 6 5- 

™Tl'TKr,~U5- 



LATTER l-ROCtiJUKE, 6 5- 1 62 
L ENG T HS„ 0 F_J iilE . 65-1^61 
Ic'S ICN 

F» 65-U 52 

Uucal 



IMPROVEMENT, o5-l'i7d 
ObJECnVE IMPROVEMENT. 65- 



1390, 65-1^36 



POST- menopause, 6*i“U0a 
PO SJr-C 



OEGRttS OF , o5-1^23 



CUNTkOL, 65-|J»6l 
PC ST-OPEHAT IVE P EH 1(1D,_6^-^ 





»■> » J -ri- .lot. U, J 

LOCAL 01 SEASL , 65-19U9 


1935 


J, qwi- 


1 — I I. H I 1.’^- 

19 7U 


t'..MN IN wr ICin. C5-I9 5d, 65- 


local PRGSTAMC CANCER, a 5- 


OBJECTIVE 


Remission. 65-- 


POSTMENOPAUSAL hOMEK 


19/5 

(lAjN Or Wl 1 i;MT, 65— 19d5 


197 5 

LONGEST SINGLE SURVIVAL, 65- 


1395, 65- 
19 35 , 65- 


I39ti, ^'05-1908, 65- 
•1971 


UCPHOKEoTGMY IN, 6 5-191*5 
PlSTMORTEM STUCIES. 65-1963 


G-HLIL^A SAPCOPA, 65-193.3 
CL AND ^>AR^:NC^•Y.•••A, 6 5-1963 

r. 1 • A li C C 1 f. 1 L C. A 1’ J ... 4 1 y. 4- i\ 


1 3SC 

LOW AjRENCCORT 1 LAL ACTIVITY, 


OdJCCT IVE 
OdJLCTIVc 


RESPONSE, 05-1922 
SIGNS Or 


PGSTOPEKAII Vt CARE* 65-lJVU 
PUS IC.PERAT IVE DfcATH, 65-1A19 



O^EVi ’snFSf, 65- 1 ^iC3 

GPUUPS iir »"*ATIINrS, 6 5-U5'2 
GKllWTiT HORPCNE , 65-1A /3 



LOW PRE -UFE RAT I. V u LEVcL S OF 
.UN IN ARY ESTKUGEN, 65-l\?6 



CCCU RKENCE O F HEMOLYS I S. 65- 



H 



73 

OlST RUGEN PKQJUCTION, to5- 



65-i ^oa 



35 

DUPHUHdLlUMV 



PUS luPEKATl VE SudSTlTUTIOM 
ni ERAPV. 



Hr POLYS I s 

nCf.iIRP LNCF riF ♦ 65-1 ^/3 



L; NANC L hose, 05"U,ti5 



hemuptys is 

CESSATION CH. 65-1390 



PAINTrUANCi: 1 HLKAPV , 65- 
IMd , 65- I A 79 



OVERALL EFELCr UH, 65-IA35 
C OPHURh CT LMY IN 



PKE 

state CF, ol.-L4nri_ 



PJSIHENUPALSAL hUMEN, to5“ 
l.Vi.5 



PrE-MENUPAoSAL cases, 65- 



MAJOR oENEFIT, 65-1^70 
MALE PATIENT, 6!i-lA76 



CPERATIVE MCRIALirY, 65- 
1390, 65-LA6y 



PkEHMINARY report, 05-U77 
PKb_^EJi UPAUSAL PAT1ENT_.__65- 



1^ 19 

PRuCPERATIVE BlDChfcMJCAL 



HfcPATIC I NVOLVL-PENT , o5-1922 
HEPATIC METASTASIS 



MAL lONANL lES 

TREATMENT [ )r . o5- lAoi 



OPERATIVE TELhNiOUE. 
1- 3j90_,__o.5--1-9 63 



ASSESSMENT, 65-Wtal 
PRIMARY 



CONTRAI MilCAThS 
AQPFNAl ECTrmV. 



HIGH PRC-OPEUAT IVE ESTROGEN 
rYr.RrTTPM, <3S-TA76 



MALIGNA.U tumors, 65-1959 
MAMMARY GANG EK, 65-1992 



mammary CARCINUMa, a5-l907, 
65-1 5? 



OPPOSITE AURENAL GLANU, 65- 



USSCUUS LESIONS 
_RAP1Q REPAIR UF, 65-1959 



APPERANCE CF, 65-19U3 
PRIMARY AUOl SON'S OISEASE. 



65-1929 
PRI MARY AUkiNAL 



Hloh Pk lUK, 65“ 1923 
HISTCLOGICAL STliUY, 65-1929 



HISTOLOGY OF TUMCR, 65-l9Cf3 

HISTORICAL INTRUCUCT lUNr 

HCRHGNAL KEPLACMENI T^.tRAPY, 
65- I4il9 



mammary CAkCINGMA UNG 

n LX^ L JPMEN T OF, o5-l9J6 



MASSES 

II, ?..£ _j) F, 6 5-1 3 <? IL. 

MAST rCTl.MY 

T IMt- or « 65-1^,52 



CSTEULVTIG LESIONS 
_ SCL EHUS I S O F . 6a-I39Q 



OVARIAN CASTRaTIOh 

Therapeutic po 

CF, 05-193U 
GVcRALL Ei-F_LCT uF 



INSUFFICIENCY, 65-192^* 
Phi MARY uRPhTH. 65-l i9«J 



PRIMARY HYPCPHYSECTLMY, 65- 
LlU_ 



Primary lesicn, 6S-i97o 

PRUULcM GF REHArilLiTAT 






FORMCNE IMOALANCE 

CUMPLICATIC N Cf_,_ _65~ L 95 0 



HOI HCNC SUPPRESSION, 6 5-1950 
HORMONE THCKAPV, 65-195?. 

65-1A62 
HOR«(lNHS 



MEAN DURATION, 6y-l9U9 
_ME AN S_UiiXLVAj^.Q F ^PA_TiE.NTS^ _ 



OUPHCRtCTCMVe 65-1935 



65— 19 62 

65-1919 
ML-MNCi T I S 



o5-l9 Id 
PROGUCE MAf-TH TH YK TTROP 1 N , 



PAIaV- 



P 



65-1929 
PRCjg iL C- Cl I Min At hfaiFfit . 



o 5- I 962 

PH(iCLci:^CCMP£NSATurtY „ 



'"iNFLJLNCt OF, 65-1395 
HUM ON TUMORS, 6 S- 1929 



HUMANS 

PROSTATE IN, 65-1962 



HYPERTHYPOIOI S.M 
CASE OF, 65-1950 
HYPOPHYSEAL STALK, 65-I9d3 
HYPQPMYSEC -gMY 



OEVELuPMENT of, o5— 19o5 

JLU±0 P A U 5 A L G R G UX,_i^5^9 J5_ 

MfNUPAUSAL PATibNlS, 65-1959 
METASTATIC OREASI GANG I^Rj, 



ALLcVIATION UF 
6 ■ “1.9.23, 



£>5-1399, 



EFFECT OF, 65-1969 
EFFECTS OF, 65-1963, 65- 



65- 1390, 65-1905, 65-19ld, 

66- 1922, 66-1935, 65-19 6d, 
65-1976 

METASTATIC d REAST CAKClNCMft, 



CESSATIUN UF, 05-1390 
CXHP LETE ALLEVULLUN OF. 



HYFUPHYSEAL U VtRACT IV I TV , 
99- l„^9 



65“195d 
partial relief UF, 05-1903 
RELIEF OF, 65-1hIB, o5- 
U75 



PRUGKtSSlVE METASTATIC 
UdfASI^ CANC ER. 65-U93 . 65- 



1976 

hXCFPTICN OF, 65-1399 



65-l9o3, 05-1977 
METASTATIC CANCER, 



65-1919 



MFl A STATIC CARCINOMA, 
1929 



PALLIATION 
UEoKEE OF, 



1906 

PROLUNCAUaN OF SURVIVAL, 
65-1931 

PkUFPT RELIEF. 05-19TQ 



65-1909 



PkOSPECT UF PALLIATION, 65- 
1399 



PKUSPbCT OF, 05-1399 
PALLIATIUN CF OREAST, o5- 



PkUSTATE CANClR, 65-1975 
PROSTATE CLANO, 65-1931 



VALUE CF, 65-1960 



MGTASf AT IC DEPOSITS 

REu ENE RATlCN IN, 65-1970 



MET ASTAT IC GRUhTH 

REGRESSION GF, 65-1959 



1929 

PA L L_l AXLVE_ti bJN h FITS, 

palliative TRCATMcNT, o6- 
19 79 



65-1 961 



PROSTATE IN HUMANS, 65-1962 
PROSTA TE tum or; 65- 1.9_9 6_ 



PhUSTATIC CANCER, 65-1913, 
65-1915, 65- 1962 



iMMEDIATf PAIN RELIEF, 65- 



MCTASTATIC INHIdlTION, 06 - 
195 3 



EVALUATiuN UF , o5-l‘tU8 
PARTIAL RELIEF OF PAIN, 65- 



PKUSTATIC CARCINOMA 
CASE CF, 65-1995 



19 7C 

IMKEOIATE POSTOPERATIVE 



METASTATIC LESICNS, 65-1390, 
65-1995 



1903 

PATIENT FOUR YEARS, 



PkOSTAUO CARCINOMAS, 65- 
19 29 



PER KID, 0 5—1936 
IMPORTANCE OF ESTROGENIC 


METASTAT IC MAMMARY 
CARCINOMA, 65-19B3, 65-1985 


PATIENTS 

GROUPS UP, 65-1952 


D 


SUBSTANCES, 65-1936 
IMPKCVEMENT 


METHODS OF ENOUCRINE 
CCMROL, 65-1967 


MEAN SURVIVAL ur , o5-l962 
NORMAL LEVCLj In, 65-1929 


K 


OBJCCTIVE EVIUENCE UF , 65- 
19 7B 


MONTH PERIOD, 65-1390 
MONTHS IN relative COMFUUT, 


PERCLhTAGE uP, o5-l392 
SELcCnUN CF, O5-l90d 


RADICAL MASltCTUHY AG, 6>- 
19 52 - 


IMPROVEMENT OF VISUAL 
FIELDS, 65-1390 


65-1971 


SEKitS OF, 65-1913 
VISUAL FIELDS livl, o5-lJ*iO 


KAUlCACTIVt OULD, o5-l39o 
RADIOACTIVE GULO SE6US. o5- 


INCREASES APPETITE, 65-1390 
INDICATIONS OF SUBJECTIVE 


M 


: VTItNTS ALIVE, 05-1962 
PATIENTS ESTROGEN EXCRETION, 


1963 

KAUICACIIVE YTTRIUM, 65-1965 


RLSPUNSE, 65-1390 
INDIVIDUAL* FRCULEM, 65-1918 


Bl 

NEOPLASM 


65-1970 

PATIENTS UNFI T, 63-1396 


RAPIG REPAIR UF CSStUUS 
LESIONS, 65-1959 


INFLUENCE OF HORMONES, 65- 
1395 


RhuKESSlON OF, 65-1985 
NODE INVOLVEMENT, o5-l952 


PELVIC BUiNh LESIONS* 65-1903 
PELVIC CGMPL ICATIUNS NUDE 


ilAPlD SKELETAL INVCLVEMENf, 
65-1905 


INIHIlITinN OF TLPCR GROWTH 
PREFACES, 65-1395 


NORMAL ACTIVITY, 65-1922, 
65-1985 


INVOLVEMENT, 65-1952 
PLLVIC REGION. 65-1959 


RAPID TkcATMENl, 65-1909 
HeACT IVAT lUN 


INTERSTITIAL IRRACIATION, 
65-1396 


■ NORMAL LEVELS IN PATIENTS, 
65-1929 


PERCENT CF CANCERS, 05-1962 
PERCENT OF THKEE-MuNTH 


EVIDENCE Of, 65-1936 
KEC ALC IFICAT lUN CF SCNt 


IC LESIONS, 6 5- 


NORMAL L IF6, 65-1965 ’ 

normalization of 


SURVIVORS, 65-1902 
PERCENTAGE OF PATIENTS. 65- 


HtTASTASES, o5'-l93o 
REC.IRRENT breast cancer. 65- 


ERIC 


TEMPGRATUKE, 65-1959 


1392 


19 69, 




224 . 







225 



Ri.Ul|RKLNT McTASlATIC 
CArCINLKA, u 5- 1909 




SUFI TISSCe invasion, 65- 
IhO'j 


TREATMENT PREFACE, 65-1936 
TUMOR 


t . 1 U. C ’l iVl N IN CS f R*C6 ON 


“Lt vTui 


1 SCFT riSSLE LESICNS, 65-1909 

i)l SAPPL AKmNCC of, 65-1936 


HISTULUUY OF, 65-l90d 
REoREi_S10N UF , 65-1923 


K • 1 1 u rril:N' 1 cTT 

6 5-1909 




SPINAL COi<h MET AST ASES, 6 5- 
19 1 0 


TUMCK CcLLS, o5 — Ih3J 
TUMCR CEVFLCPMeNT , ob-liJ3 


R; i.E.M.RAr lUN IN MI ASTATIC 


SPIimAL RhucNCKAfluN 


TJMCR GRChTH, o5-l9cil, 65- 


J‘ ^’C.SIJ S,_C-.5rl 9?C 




f.VlLENCt OF, 63-19/0 


1 9_0_9 


\ A ( 3'i A L L Y P..P i f "nG D L S ' 


dtcfARh" 


SPlLLN-ADnENAL 


. TUMCR uHUrtTh Prefaces 


I FPALPAilLL, 65-19 7C 




VENOVENOSTOMY, o5-1907 


I NlMtJl TIUN OF , 65-l39b 


Kt i;p|; S GN 




STATE CF FKtr 65-19Cd 


TUMOR IT SELF 1 o5-l9 73 


nilJcCTlYh SIGNS (tF, 


6 5- 


SThNCIO HCRMCNhS, o5-l929 


TUHCR sue, 65-1907 



1 •. t j 

— rSStrN fU .MIT ASIA u c 

_ _K| . . f S SION (T NbCP LJ SJ' t _o 
"i'.W" 

u><i r; t ii:n >:r- sk i n 



STtKOlU MtTAUOLlSMi 6!>-UlU 
^S.T I Lf C>_T j?,C J.R E AIME NT , 6 5- 

STC L(<mC'*1 

k‘a I< C L ii A S K A h U b i T A L Vn p 
6^-1 \U? 



T JMCRS 

J: ^JULf<^^lE OEPfcNJtN Cb OF, 

6b“U6 7 



\} 



■'I l I A !% f A b (. S t <j ‘j — I ^ b k 
jM . i i I. ^ bjj liV n< J_LHUK ,__6 W <1 7 3_ 
fV,7.7i't;TL ITAT iU.\ 

CF. 6b~KUl 



SUtiJi-UJv/L iii'.NtHWj, b-j-UO^P 
_SpU JuCJ I V L J.':' P.'iC.y.^.''^LNT i_657__ 
*"Vjvdp* o‘‘3-'l>^V, “d^-YjVop’^bb- 
I'.rjl, 65-l^i>4, 65~ 



kH.AIIVG CC«»*lJR 1 
t^CMHS IKp 65-U7t 



U6^i, >jt >- tit 65- 
bUIUrCTIVt: kuSPUNSEi 



__y WiJ c H 6 Ui'^.C_.^LH.uN A LLCTUMl tii I 

ab- 1 4 1 3 I 6 5- 

. UNIL ATb RAL AURENALECTQHYp 



6b“lA0/ 

UNINARV ESTR6ULN- 



•-GUL*- tr; nrNH pain, 05 - 13^5 

HU.lEf Ot- PhlUt t-S-lAlOp 65- 
U75 

AVtKAUl aUR.M I UN CF, 65- 
65-IAJ3 



LCrt PKfc-OPLRATIVE LtVELS 
. 5 - Lft.7.6. ^ 



Im:1CAT ions UF, 05-U90 
SUUSLfjUEi^l Ubi UMI bA INt 65- 
IA70 UKINAKV CALCIUM 

_SLCCF,S.SLVk .ATTEMPTS f>5-l^lp FALL_IN, 6i;-rl^<35 

SUKbiCAL Ub£ UF AOKtNAltCTJHY , 

hPFLCTS nr. 6b-L^3>j U 19 



FLM I*H)S UT I 
M[ .MiSSKiNS 



6 5- lA 59 



JURAT! UN OF, 65-U3c 
NUMHLR fir t 65-1A19 



SUKuiCAL CASfKATluAI, 65“ 
lAjiui o5-lA*t5p u5-lAS^ 



SURUCAL MURTAHrV, 65-UOd 
SURlilCAL TKAUMA 



V 



HFMCVAL CF bNCLICRiNE GLANCSt 
h5-lAS2 ^ 



RLUUCTECN UF, 63-IA09 
9MRICAL TfcCHNiQUESf 



VAILE Or HVPCPHYbtC TCMY, 65- 
lAdO 



»LblCUAL ANTEHILH LCric 
M SSJE • 



UtSTRUCTICN OFt 65-1963 
SPCNSIVE N'f-UPLASriS, 6 5- 



SURVlVAL 

AVKRACE LCNUTH UF, o6-l969 



PCGLUNGAMUN' CF, 65-1931 
SURVIVAL PATES P 6 5- U52 



VENCLS hEMCKKHAUE, 65-1909 
_J^L;^C_ EhAL_ME L AJij: ASES. 05-1399 



VIS lUN 

T EPPCRARY UtS rU ReANCES.CF, 



19 1C 

TINAL CETACFMFNT* 



65- 199 3 



SURVIVAL OATES EXCEPT, d5- 
-i9;>^ 



65-1971 
VISUAL FILLCS 



cVIEW OF ACKENALECTOMY , 
1 596 



SYMPTUNS 

DURATION OF, 65-1952 



IMPKOVdMENT UF , 65-1390 
■■VI.SUAL FlJrLDS IN PATIENTS, 



ISK APPEARS, 65-1965 
ENTUEN TFEKAPY, 65-19^^3 



65-1390 



iLE OF THYPOIIJ 
939 



TIILPAPY, 65- 



T AdULAR FCKM: 
TEMP FHA T URE 



M 



SCLCRtlSlS OF OSltOLItTlC 
LESIONS , 65-1390 



NORMAL 1 iATlUN OF, 65-1959 
TEMI^JKAKY niSTURdANCLS UF 



ELKS_ LAliiH. 0 5-L><U-9^ 



VISION, 65-1971 
TEMPURARY RFMISSIUN, 65-1929 



^E1 UHT 

_G A I N LN, 65-I 956«_65-19_73 



GAIN Ur, 65-l^d5 
■ WELGHT GAJjN_,^j^J95. 65- 



SuCCNOVRY GfiCWTHS, 65-1903 
SELFCTIflN (IF PATIENTS, 65 - 



19 00 

SER?FS or PATIENTS, 65-1913 



TEN PATIENTS, o5-l*t22, 65- 
19 Th 



TESTCSTEHGNE TKEATMfcNT, o5- 
I90d 



1903 

■VtOLFFE K _hYPCPHY S l:CtUMY. 
1961 

kHOM 



6.5- 



SGRUM ACID PhUSPFATASE, 65- 
1923 

SEVEN OPFRATIVE CEATHS, 65- 
19 6 9 



ScVtNTY-MNF PATIENTS, 65- 
.1390 



IhtDrtcTlCAL CONS I Uh KAT I UNS, 
19 J6 

TlifcRAPCUTIC POSSiiMLlTItS OF 
OVARIAN LAS IRA T lliN , 65-193B 



SFVL PL- GASTRIC 0 ! S I GR UANC E S, 
09-19 71 



therapeutic PURPU3ES, o5- 
1922 



FCUR UF, 65-1962 
KlJESPREAU METASlAStS, 65- 
19 73, 65- 19 7o 

rt lJESPRbAU METASTATIC ilRfcAST 



CANCER, 65-1962 
hUMEN 



THLRAPEUI IC VALUt; 
ThFRAPY 



65-1395 



CASTRAIICN IN, o5-l95U 
WORIlU^rilLL PALLIATION, 65- 



SHJ«T nUFATlON, 05-1979 
SMURT SURVIVAL, 65-1975 



SHORT TIME, 65-l97d 
SlUE EFFECTS, 65-1971 



FLRM UF , 65-19‘3d 
THEREFORE ApRENALhC TUMY i 



SIMPLE cNDOCRINL PRCCEUURES, 
6 5— 1 9 u 2 



1392 

THlkO PATIENT, 65-1916, 65- 



197U, 65-1175 

THIRTY-THREE PATIENTS,- 65- 



¥ 

-I- 



YEAR LATER, 65-1909 



SIX CASES, 65-1970 

SIX PATIENTS, 05-UC3, 65- 



1905, 65-1903 
CriCOCRiNE GLANDS CF, 65- 



19 69 
THOUGHT 



iCUTH, 05-l9d5 



THPEe-HONTH SUkVIVCRS 
PERCENT JF , 65-1962 



YEARS LATER, oo-lilO 
YEARS JF AGE, 65-1395, 55- 



19 10 



1929 


ThYRUIO EXTRACT, 65-1969 


rnuR or, 65-i9t5 


THYFGiU therapy 


SIX POST ClPtHATlVE 


roll UF , 65-1939 


mortalities, 65-19C3 


THYROIDAL DEPRESSION > 


SI ZF; 

UIMUTICN IK, 65-1390 


DEGREE CF, 65-L9d5 
T INE 


sue OF LESION, 65-1952 


LENGTHS CJF, 05-1961 


SIZE OF MASSES, 65-1390 


TIME GF MASTECTOMY, 65-l9‘2 


SKELETAL I.ESICNS 

BLASriC response in, 65- 


■ TOTAL HYPCPHYSECTOMYi o5- 

19 79 1 


19D5 

skeletal METASTAS6S, 65-1399 


TRANSIENT UIAdETES, 65-1971 
TRANSIENT SUIUECTIVE 


SKIN NETAS-TASES 

REGRESSION OF,' 65-1953 


iMPRUVEMCNTt 65-1922 
TREATMEMT 


SLIGHT REGRESS! CN GF 
CARCINCMATOUS PROSTATE, 65- 


DISCONTINUATION OF, 65- 
1995 



1 A 5 

:er|c 



■ETAL REPLACEMENT , 



TREATMENT OF HAUGNANCIES, 
65-1961 



